This is the documentation for Cloudera Manager 4.8.5.
Documentation for other versions is available at Cloudera Documentation.

Installing Spark with Cloudera Manager

Apache Spark is a fast and general-purpose cluster computing system with support for in-memory computation. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs.

Continue reading:

  1. Ports Used by Spark
  2. Limitations
  3. Installing the Spark Parcel
  4. Configuring and Starting the Spark Service
  5. Testing Your Spark Setup

Ports Used by Spark

  • 7077 – Default Master RPC port
  • 7078 – Default Worker RPC port
  • 18080 – Default Master web UI port
  • 18081 – Default Worker web UI port

For further information, see the Spark documentation at Apache Spark 0.9.

Limitations

For the Spark 0.9 release, the following limitations apply:
  • Spark does not work with secure HDFS
  • Running Spark on YARN is not supported

Installing the Spark Parcel

  1. Ensure that CDH was installed using parcels. If not, upgrade CDH using parcels.
  2. In the Cloudera Manager Admin Console, select Administration > Settings.
  3. Click the Parcels category.
  4. Find the Remote Parcel Repository URLs property and add the location of the parcel repository.
    1. Click to open a new field.
    2. Enter the URL of the location of the parcel to install (typically http://archive.cloudera.com/spark/parcels/latest).
    3. Click Save Changes to save your changes.
  5. From the Hosts page, click the Parcels tab. The parcel for the external application should appear in the set of parcels available for download.
  6. Download, distribute, and activate the parcel. For general information about installing parcels with Cloudera Manager, see Using Parcels.

Configuring and Starting the Spark Service

The following steps must be performed as the root user from the command line on the host that will run the Spark Master role.
  1. Edit /etc/spark/conf/spark-env.sh:
    • Set the environment variable STANDALONE_SPARK_MASTER_HOST to the fully qualified domain name of the master host.
    • Set the environment variable DEFAULT_HADOOP_HOME to the Hadoop installation, which is /opt/cloudera/parcels/CDH/lib/hadoop for a parcel installation.
    • Optionally set the Spark Master's port and Web UI port with environment variables SPARK_MASTER_PORT and SPARK_MASTER_WEBUI_PORT respectively.
  2. Edit the file /etc/spark/conf/slaves and enter the fully-qualified domain names of all Spark worker hosts, one name per line.
  3. Sync the contents of /etc/spark/conf to all hosts.
  4. Start the Master role on the Spark Master host. The Master role is responsible for coordinating different Spark applications (Spark contexts).
    /opt/cloudera/parcels/SPARK/lib/spark/sbin/start-master.sh
  5. If you have passwordless SSH configured for root, run the start-slaves.sh script to start all the worker roles:
    /opt/cloudera/parcels/SPARK/lib/spark/sbin/start-slaves.sh
    Otherwise, run the following on every worker node as root:
    /opt/cloudera/parcels/SPARK/lib/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://master_fqdn:master_port

Testing Your Spark Setup

To test your Spark setup, start spark-shell on one of the nodes. You can, for example, run word count:

val file = sc.textFile("hdfs://NameNode:8020/path/to/file")
val counts = file.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://NameNode:8020/output")

You can see the application by going to the Spark Master UI, by default at http://spark-master:18080, to see the Spark Shell application, its executors and logs.