Managing Spark

Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing.

To run applications distributed across a cluster, Spark requires a cluster manager. Cloudera supports two cluster managers: YARN and Spark Standalone. When run on YARN, Spark application processes are managed by the YARN ResourceManager and NodeManager roles. When run on Spark Standalone, Spark application processes are managed by Spark Master and Worker roles.

In CDH 5, Cloudera recommends running Spark applications on a YARN cluster manager instead of on a Spark Standalone cluster manager, for the following benefits:
  • You can dynamically share and centrally configure the same pool of cluster resources among all frameworks that run on YARN.
  • You can use all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
  • You choose the number of executors to use; in contrast, Spark Standalone requires each application to run an executor on every host in the cluster.
  • Spark can run against Kerberos-enabled Hadoop clusters and use secure authentication between its processes.