Installing Spark with Cloudera Manager
Apache Spark is a fast and general-purpose cluster computing system with support for in-memory computation. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs.
- Ports Used by Spark
- Installing the Spark Parcel
- Configuring and Starting the Spark Service
- Testing Your Spark Setup
Ports Used by Spark
- 7077 – Default Master RPC port
- 7078 – Default Worker RPC port
- 18080 – Default Master web UI port
- 18081 – Default Worker web UI port
For further information, see the Spark documentation at Apache Spark 0.9.
- Spark does not work with secure HDFS
- Running Spark on YARN is not supported
Installing the Spark Parcel
- Ensure that CDH was installed using parcels. If not, upgrade CDH using parcels.
- In the Cloudera Manager Admin Console, from the Administration tab, select Settings, then go to the Parcels category.
- Find the Remote Parcel Repository URLs property and add the location of the parcel repository.
- Click to open a new field.
- Enter the URL of the location of the parcel you need to install (typically http://archive.cloudera.com/spark/parcels/latest).
- Click Save Changes to save your changes.
- From the Hosts page, click the Parcels tab. The parcel for the external application should appear in the set of parcels available for download.
- Download, distribute, and activate the parcel. For general information about installing parcels with Cloudera Manager, see Using Parcels.
Configuring and Starting the Spark ServiceThe following steps must be performed as the root user from the command line on the host that will run the Spark Master role.
- Edit /etc/spark/conf/spark-env.sh:
- Set the environment variable STANDALONE_SPARK_MASTER_HOST to the fully qualified domain name of the master host.
- Set the environment variable DEFAULT_HADOOP_HOME to the Hadoop installation, which is /opt/cloudera/parcels/CDH/lib/hadoop for a parcel installation.
- Optionally set the Spark Master's port and Web UI port with SPARK_MASTER_PORT and SPARK_MASTER_WEBUI_PORT respectively.
- Edit the slaves file in /etc/spark/conf/slaves. Enter the fully-qualified domain names of all Spark worker nodes, one name per line.
- Sync the contents of /etc/spark/conf to all nodes.
- Start the Master role on the host that will act as the Spark Master. The Master role is responsible for coordinating different Spark applications (Spark contexts).
- If you have passwordless SSH configured for root, run the start-slaves.sh script to start all the worker roles:
/opt/cloudera/parcels/SPARK/lib/spark/sbin/start-slaves.shOtherwise, run the following on every worker node as root:
/opt/cloudera/parcels/SPARK/lib/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://<master_ip>:<master_port>
Testing Your Spark Setup
To test your Spark setup, start spark-shell on one of the nodes. You can, for example, run word count:
val file = sc.textFile("hdfs://namenode:8020/path/to/file") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://namenode:8020/output")
You can see the application by going to the Spark Master UI, by default at http://spark-master:18080, to see the Spark Shell application, its executors and logs.