Managing Spark

Apache Spark is a general framework for distributed computing that offers very high performance for both iterative and interactive processing. Spark exposes APIs for Java, Python, and Scala. Spark consists of Spark core and several related projects:
  • Spark SQL - module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs.
  • Spark Streaming - API that allows you to build scalable fault-tolerant streaming applications.
  • MLlib - library that implements common machine learning algorithms.
  • GraphX - API for graphs and graph-parallel computation.

Cloudera supports Spark core and Spark Streaming. Cloudera does not currently offer commercial support for Spark SQL (including DataFrames), MLLib, and GraphX.

To run applications distributed across a cluster, Spark requires a cluster manager. Cloudera supports two cluster managers: Spark Standalone and YARN. Cloudera does not support running Spark applications on Mesos. On Spark Standalone, Spark application processes run on Spark Master and Worker roles. On YARN, Spark application processes run on YARN ResourceManager and NodeManager roles.

In CDH 5, Cloudera recommends running Spark applications on a YARN cluster manager instead of on a Spark Standalone cluster manager, for the following benefits:
  • You can dynamically share and centrally configure the same pool of cluster resources among all frameworks that run on YARN.
  • You can use all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
  • You choose the number of executors to use; in contrast, Spark Standalone requires each application to run an executor on every host in the cluster.
  • Spark can run against Kerberos enabled Hadoop clusters and use secure authentication between its processes.