CDH 6 includes Apache Kafka as part of the core package. The documentation includes improved contents for how to set up, install, and administer your Kafka ecosystem. For more information, see the Cloudera Enterprise 6.0.x Apache Kafka Guide. We look forward to your feedback on both the existing and new documentation.

Using Apache Kafka with Apache Spark Streaming

For information on how to configure Apache Spark Streaming to receive data from Apache Kafka, see the appropriate version of the Spark Streaming + Kafka Integration Guide: 1.6.0 or 2.3.0.

In CDH 5.7 and higher, the Spark connector to Kafka only works with Kafka 2.0 and higher.

Validating Kafka Integration with Spark Streaming

To validate your Kafka integration with Spark Streaming, run the KafkaWordCount example.

If you installed Spark using parcels, use the following command:
/opt/cloudera/parcels/CDH/lib/spark/bin/run-example streaming.KafkaWordCount <zkQuorum> <group> <topics> <numThreads>

If you installed Spark using packages, use the following command:

 /usr/lib/spark/bin/run-example streaming.KafkaWordCount <zkQuorum> <group> <topics><numThreads>
Replace the variables as follows:
  • <zkQuorum> - ZooKeeper quorum URI used by Kafka (for example, zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181).
  • <group> - Consumer group used by the application.
  • <topic> - Kafka topic containing the data for the application.
  • <numThreads> - Number of consumer threads reading the data. If this is higher than the number of partitions in the Kafka topic, some threads will be idle.