Enterprises relentlessly try to use the ever-increasing streams of data to their advantage. IDC predicts that the world of data will grow from 33 ZB in 2018 (1 Zettabyte = 1 Million Petabytes) to 175 ZB by 2025. The IT departments are expected to build faster and resilient data pipelines that are fault-tolerant and can autonomously manage tasks, like scaling up the infrastructure, to provide just enough processing power to meet the business SLAs consistently. They are expected to allocate storage to dynamically meet the growing data needs while keeping the overall budget into consideration.
Hence, data engineers can concentrate on writing the code for the pipeline, given that their primary goal is to find the ways to leverage their enterprise’s data assets to improve decision-making, open new growth opportunities, contribute to customer acquisition/retention strategies, and better serve the ever-changing data requirements.
This blog provides an overview of agile data processing options available to enterprises on the Google Cloud Platform (GCP) for data analytics. It includes information on various data processing options available on GCP, i.e., data ingestion at scale, building reliable data pipelines, architecting modern data warehouses or data lakes, and providing platforms for analytics. While the scope of this document is not to give an overview of the entire list of options, it would touch upon the most relevant services, so that this can help the reader in developing strategies for migrating or building a streaming analytics pipeline, starting from ingestion till consumption.
As mentioned above, the data processing options available on GCP is part of its broader data analytics offering. Let us take a look at each high-level component of the pipeline.
The first step is to get the data into the platform from various disparate sources, generally referred to as Data Ingestion. After getting data into the data pipeline, data is enriched and transformed for analytical purposes. This processed data is typically stored in a modern data warehouse, or data lake, where it is finally consumed by enterprise users for operational and analytical reporting, machine learning, and advanced analytics, or AI use cases. This whole pipeline can be automated using an orchestrator like Cloud Composer.
In a serverless (fully managed) data analytics platform, unlike a traditional data analytics platform, the infrastructure and platform nuances, like monitoring, tuning, utilization, resource provisioning, scaling, reliability, etc., is moved away from the application developer and bundled into the hands of the platform provider. Let’s take a deeper dive into each of the stages and the corresponding relevant services.
Cloud Pub/Sub: Event-driven data ingestion and data movement mechanism follows a publish/subscribe pattern to improve reliability. Scalable up to 100GB/sec with consistency, this is sure to satisfy the scale of almost any enterprise. The data can be retained for several days, although it is set for 7 days by default. This service is deeply integrated with other components of the GCP analytics platform.
Cloud Dataflow: It is used for streaming data processing in real-time. The traditional approach to building data pipelines was to create a separate codebase (Lambda architecture patterns) for batch, micro-batch, and stream processes. Cloud Data can create a unified programming model so that the users can process the workloads with the same code base. It also simplifies operations and management.
Apache Beam: An open-source unified model with a set of SDKs that can define and execute data pipelines. It gives a variety of flexibility to the users to use languages likes Java, Scala, and Python, to develop their code. Furthermore, Dataflow is built on Apache Beam SDKs. Building your codebase in Beam even allows developers to run or port the code to other processing engines like Spark, Flink, Dataflow, etc.
Dataproc: It is a fully managed Apache Hadoop and Spark service, which allows the user to use all familiar open-source Hadoop tools like Spark, Hadoop, Hive, Tez, Presto, Jupyter, etc. and tightly integrate it to the services within a GCP ecosystem. It gives flexibility to rapidly define clusters, define machine types to be used for master and data nodes.
There are two types of Dataproc clusters that can be provisioned in GCP. The first one is the ephemeral cluster (cluster is defined when a job is submitted, scaled up or down as needed by the job, and is deleted once the job is completed). The second one is the long-standing cluster, where the user creates a cluster (comparable to an on-premise cluster) with a defined number of the minimal and maximum number of nodes. Here, the jobs will be executed within the constraints, and when the jobs are completed, the cluster scales down to the minimum constraint. Depending on the use case and processing power needed, this gives the flexibility to define the type of clusters.
Dataproc is an enterprise-ready service with high availability and high scalability. It allows both horizontal scaling (scales to the tune of 1000s of nodes per cluster) as well as vertical scaling (configurable compute machine types, GPUs, Solid-state drive storages, and persistent disks).
BigQuery: It is a modern data warehouse offered as part of the Google Cloud Platform, which is an ANSI SQL compliant. It is a completely managed and serverless environment and is a petabyte-scale data warehouse. Here, data is securely encrypted and is durable. BigQuery natively supports real-time streaming as well as machine learning using BigQuery ML.
Cloud Data Fusion: It is a fully managed enterprise data integration service for building and managing data pipelines. Developers, analysts, and data scientists can use it for visually creating data pipelines, testing, debugging, and deploying. Natively, it also supports everyday data engineering tasks, like data cleansing, matching, de-duping, blending, transforming, etc. It helps in running data pipelines on-scale on GCP and operationalize data pipelines.
Cloud Composer: A fully managed, workflow orchestration service. It is built on an Apache Airflow open source project and enables the users to author, schedule, and monitor end-to-end data pipelines. Composer provides an interface for graphical representations of the workflow, which helps in smooth management of the workflow. The pipelines are configured as directed acyclic graphs (DAGs) using Python. Cloud Composer natively integrates well with all GCP dataand analytics services and also provides the ability to connect the pipeline through a single orchestration tool irrespective of where the workflow resides - on-premise or on the cloud.
A typical architecture for implementing a batch or streaming pipeline would include, but not limited to, the following components.
All these components are examples of fully-managed services on GCP, and one can design and implement pipelines by swiftly using them. Below is an example of a data pipeline using Cloud Dataflow.
Another variant to this, as explained above, is for environments with dependencies on Hadoop/Spark tools, it is recommended to use data pipelines using Dataproc or Dataflow.
Picking the right options for data processing and analytical requirements depends on a variety of factors, including talent, cost, time to market, processing volumes, future product capabilities, etc. Below are a few general selection criteria while making decisions around real-time data processing needs.