Every large enterprise organization is attempting to accelerate their digital transformation strategies to engage with their customers in a more personalized, relevant, and dynamic way. The ability to perform analytics on data as it is created and collected (a.k.a. real-time data streams) and generate immediate insights for faster decision making provides a competitive edge for organizations.
Organizations are increasingly building low-latency, data-driven applications, automations, and intelligence from real-time data streams. Use cases like fraud detection, network threat analysis, manufacturing intelligence, commerce optimization, real-time offers, instantaneous loan approvals, and more are now possible by moving the data processing components up the stream to address these real-time needs.
Cloudera Stream Processing (CSP) enables customers to turn streams into data products by providing capabilities to analyze streaming data for complex patterns and gain actionable intel. For example, a large biotech company uses CSP to manufacture devices to exact specifications by analyzing and alerting on out-of-spec resolution color imbalance. A number of large financial services companies use CSP to power their global fraud processing pipelines and prevent users from exploiting race conditions in the loan approval process.
In 2015, Cloudera became one of the first vendors to provide enterprise support for Apache Kafka, which marked the genesis of the Cloudera Stream Processing (CSP) offering. Over the last seven years, Cloudera’s Stream Processing product has evolved to meet the changing streaming analytics needs of our 700+ enterprise customers and their diverse use cases. Today, CSP is powered by Apache Flink and Kafka and provides a complete, enterprise-grade stream management and stateful processing solution. The combination of Kafka as the storage streaming substrate, Flink as the core in-stream processing engine, and first-class support for industry standard interfaces like SQL and REST allows developers, data analysts, and data scientist to easily build real time data pipelines that power data products, dashboards, business intelligence apps, microservices, and data science notebooks.
CSP was recently recognized as a leader in the 2022 GigaOm Radar for Streaming Data Platforms report.
This blog aims to answer two questions as illustrated in the diagram below:
Figure 1: The evolution of Cloudera Stream Processing offering based on customers’ evolving streaming use cases and requirements.
As customers started to build data lakes and lakehouses (before it was even given this name) for multifunction analytics, a critical number of desired outcomes started to emerge around data ingestion:
These desired outcomes beget the need for a distributed streaming storage substrate optimized for ingesting and processing streaming data in real-time. Apache Kafka was purpose-built for this need, and Cloudera was one of the earliest vendors to offer support. The combination of Cloudera Stream Processing and DataFlow powered by Apache Kafka and NiFi respectively has helped hundreds of customers build real-time ingestion pipelines, achieving the above desired outcomes with architectures like the following.
Figure 2: Draining Streams Into Lakes: Apache Kafka is used to power microservices, application integration, and enable real-time ingestion into various data-at-rest analytics services.
As Kafka became the standard for the streaming storage substrate within the enterprise, the onset of Kafka blindness began. What is Kafka blindness? Who is affected? Kafka blindness is the enterprise’s struggle to monitor, troubleshoot, heal, govern, secure, and provide disaster recovery for Apache Kafka clusters.
The blindness doesn’t discriminate and affects different teams. For a platform operations team, it was the lack of visibility at a cluster and broker level and the effects of the broker on the infrastructure it runs on and vice versa. While for a DevOps/app team, the user is primarily interested in the entities associated with their applications. These entities are the topics, producers, and consumers associated with their application. The DevOps/app dev team wants to know how data flows between such entities and understand the key performance metrics (KPMs) of these entities. For governance and security teams, the questions revolve around chain of custody, audit, metadata, access control, and lineage. The site availability teams are focused on meeting the strict recovery time objective (RTO) in their disaster recovery cluster.
Cloudera Stream Processing has cured the Kafka blindness for our customers by providing a comprehensive set of enterprise management capabilities addressing schema governance, management and monitoring, disaster recovery, simple data movement, intelligent rebalancing, self healing, and robust access control and audit.
Figure 3: Cloudera Stream Processing offers a comprehensive set of enterprise management services for Apache Kafka.
By 2018, we saw the majority of our customers adopt Apache Kafka as a key part of their streaming ingestion, application integration, and microservice architecture. Customers started to understand that to better serve their customers and maintain a competitive edge, they needed the analytics to be done in real time, not days or hours but within seconds or faster.
The vice president of architecture and engineering at one of the largest insurance providers in Canada summed it up well in a recent customer meeting:
We can’t wait for the data to persist and run jobs later, we need real-time insight as the data flows through our pipeline. We had to build the streaming data pipeline that new data has to move through before it can be persisted and then provide business teams access to that pipeline for them to build data products.”
In other words, Kafka provided a mechanism to ingest streaming data faster but traditional data-at-rest analytics was too slow for real-time use cases and required analysis to be done as close to data origination as possible.
In 2020, to address this need, Apache Flink was added to the Cloudera Stream Processing offering. Apache Flink is a distributed processing engine for stateful computations ideally suited for real-time, event-driven applications. Building real-time data analytics pipelines is a complex problem, and we saw customers struggle using processing frameworks such as Apache Storm, Spark Streaming, and Kafka Streams.
The addition of Apache Flink was to address the hard problems our customers faced when building production-grade streaming analytics applications including:
Apache Kafka is critical as the streaming storage substrate for stream processing, and Apache Flink is the best in breed compute engine to process the streams. The combination of Apache Kafka and Flink is essential as customers move from data-at-rest analytics to data-in-motion analytics that power low latency, real-time data products.
Figure 4: For real-time use cases that require low latency, Apache Flink enables analytics in-stream without persisting the data and then performing analytics.
While Apache Flink adds powerful capabilities to the CSP offering with a simple high-level API in multiple languages, constructs of stream processing like stateful processing, exactly once semantics, windowing, watermarking, subtleties between event, and system time are new concepts for most developers and novel concepts to data analysts, DBAs, and data scientists.
Meet Laila, a very opinionated practitioner of Cloudera Stream Processing. She is a smart data analyst and former DBA working at a planet-scale manufacturing company. She needs to measure the streaming telemetry metadata from multiple manufacturing sites for capacity planning to prevent disruptions. Laila wants to use CSP but doesn’t have time to brush up on her Java or learn Scala, but she knows SQL really well.
In 2021, SQL Stream Builder (SSB) was added to CSP to address the needs of Laila and many like her. SSB provides a comprehensive interactive user interface for developers, data analysts, and data scientists to write streaming applications with industry standard SQL. By using SQL, the user can simply declare expressions that filter, aggregate, route, and mutate streams of data. When the streaming SQL is executed, the SSB engine converts the SQL into optimized Flink jobs.
Figure 5: SQL Stream Builder (SSB) is a comprehensive interactive user interface for creating stateful stream processing jobs using SQL.
During a customer workshop, Laila, as a seasoned former DBA, made the following commentary that we often hear from our customers:
Streaming data has little value unless I can easily integrate, join, and mesh those streams with the other data sources that I have in my warehouse, relational databases and data lake. Without context, streaming data is useless.”
SSB enables users to configure data providers using out of the box connectors or their own connector to any data source. Once the data providers are created, the user can easily create virtual tables using DDL. Complex integration between multiple streams and batch data sources becomes easier like in the example below.
Another common need from our users is to make it simple to serve up the results of the streaming analytics pipeline into the data products they are creating. These data products can be web applications, dashboards, alerting systems, or even data science notebooks.
SSB can materialize the results from a streaming SQL query to a persistent view of the data that can be read via a REST API. This highly consumable dataset is called a materialized view (MV), and BI tools and applications can use the MV REST endpoint to query streams of data without a dependency on other systems. The combination of Kafka as the storage streaming substrate, Flink as the core in-stream processing engine, SQL to build data applications faster, and MVs to make the streaming results universally consumable enables hybrid streaming data pipelines described below.
So did we make Laila successful? Once Laila started using SSB, she quickly utilized her SQL skills to parse and process complex streams of telemetry metadata from Kafka with contextual information from their manufacturing data lakes in their data center and in the cloud to create a hybrid streaming pipeline. She then used a materialized view to create a dashboard in Grafana that provided a real-time view of capacity planning needs at the manufacturing site.
In subsequent blogs, we’ll deep dive into use cases across a number of verticals and discuss how they were implemented using CSP.
Cloudera Stream Processing has evolved from enabling real-time ingestion into lakes to providing complex in-stream analytics, all while making it accessible for the Lailas of the world. As Laila so accurately put it, “without context, streaming data is useless.” With the help of CSP, you can ensure your data pipelines connect across data sources to consider real-time streaming data within the context of your data that lives across your data warehouses, lakes, lake houses, operational databases, and so on. Better yet, it works in any cloud environment. Relying on the industry standard SQL, you can be confident that your existing resources have the know-how to deploy CSP successfully.
Not in the manufacturing space? Not to worry. In subsequent blogs, we’ll deep dive into use cases across a number of verticals and discuss how they were implemented using CSP.
Cloudera Stream Processing is available to run on your private cloud or in the public cloud on AWS, Azure, and GCP. Check out our new Cloudera Stream Processing interactive product tour to create an end to end hybrid streaming data pipeline on AWS.
What’s the fastest way to learn more about Cloudera Stream Processing and take it for a spin? First, visit our new Cloudera Stream Processing home page. Then download the Cloudera Stream Processing Community Edition on your desktop or development node, and within five minutes, deploy your first streaming pipeline and experience your a-ha moment.
This may have been caused by one of the following: