CDS Powered by Apache Spark Overview

Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala.

For detailed API information, see the Apache Spark project site.
CDS Powered by Apache Spark is an add-on service for CDH, distributed as a parcel and custom service descriptor, consisting of Apache Spark 2 core and several related projects:
  • Spark SQL: Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs.
  • Spark Streaming: API that allows you to build scalable fault-tolerant streaming applications.
  • MLlib: API that implements common machine learning algorithms.

Cloudera products include these versions of Apache Spark: 1.6, 2.0, 2.1, 2.2, 2.3, and 2.4.

Spark 1.6 is included as part of CDH 5 in Cloudera Enterprise 5.7.x and higher. The latest documentation is available at Cloudera Enterprise documentation.

This document describes the separately released CDS 2.4 Powered by Apache Spark. It is shipped separately for ease of use and convenience of consumption. It enables customers to install and upgrade the features of Apache Spark 2 without going through a full upgrade of the CDH cluster.

On CDH 5, a Spark 1.6 service can coexist with a Spark 2 service. The configurations of the two services do not conflict and both services use the same YARN service. The port of the Spark History Server is 18088 for Spark 1.6 and 18089 for Spark 2.

Unsupported Features

Consult CDS Powered by Apache Spark Known Issues for a comprehensive list of features that are not supported with CDS Powered by Apache Spark.