X

Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

Cloudera named a leader in 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems Get the report

Ready to Get Started?

 

NOTICE

 

As of January 31, 2021, this tutorial references legacy products that no longer represent Cloudera’s current product offerings.

Please visit recommended tutorials:

 

Objective

To understand the concepts behind messaging systems in distributed systems, and how to use them to pass on information between the producer (publisher, sender) to the consumer (subscriber, receiver). In this example, you will learn about Kafka.

Outline

What is a Messaging System?

Messaging systems transfer data between client applications. One application produces the data, such as reading from sensors embedded on vehicles and the other application receives the data, processes it to be ready to be visualized to show the characteristics about the drivers driving behavior who drive those vehicles. As you can see each application developer can focus on writing code to analyze the data and not worry about how to share the data. There are two messaging systems used in this scenario, point to point and publish subscribe. The system most often used is publish subscribe, but we will look at both.

Point to Point System

Point to Point are messages transmitted into a queue

Point to Point

Key Characteristics of generic figure above:

  • producer sends messages into the queue, each message is read by only one consumer
  • once the message is consumed, it vanishes
  • multiple consumers can read messages from the queue

Publish-Subscribe System

Publish-Subscribe are messages transmitted into a topic

Publish Subscribe

  • message producers are known as publishers
  • message consumers are known as subscribers

How does the Pub-Sub messaging system work?

  • publisher sends messages into 1 or more topics
  • subscribers can arrange to receive 1 or more topics, then consume all the messages

What is Kafka

Apache Kafka is an open source publish-subscribe based messaging system responsible for transferring data from one application to another.

Architectural Overview

At a high level, our data pipeline looks as follows:

Kafka Arch High Level

NiFi Producer

Produces a continuous real-time data feed from truck sensors and traffic information that are separately published into two Kafka topics using a NiFi Processor implemented as a Kafka Producer.

To learn more about the Kafka Producer API Sample Code, visit Developing Kafka Producers

Kafka Cluster

Has 1 or more topics for supporting 1 or multiple categories of messages that are managed by Kafka brokers, which create replicas of each topic (category queue) for durability.

Storm Consumer

Reads messages from Kafka Cluster and emits them into the Apache Storm Topology to be processed.

To learn more about the Kafka Consumer API Sample Code, visit Developing Kafka Consumers

Benefits of Kafka

Reliability

  • Distributed, partitioned, replicated and fault tolerant

Scalability

  • Messaging system scales easily without down time

Durability

  • "Distributed commit log" which allows messages to continue to exist on disk even after the processes that created that data have ended

Performance

  • High throughput for publishing and subscribing messages
  • Maintains stable performance for many terabytes stored

Next: Kafka in Action

Now that we've become familiar with how Kafka will be used in our use case, let's move onto seeing Kafka in action when running the demo application.



Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.