This is the documentation for CDH 5.1.x.
Documentation for other versions is available at Cloudera Documentation.

Running Crunch with Spark

The blog post How-to: Run a Simple Apache Spark App in CDH 5 provides a tutorial on writing, compiling and running a Spark application. Taking that article as a starting point, do the following to run Crunch with Spark.
  1. Add both the crunch-core and crunch-spark dependencies to your Maven project, along with the other dependencies shown in the blog post.
  2. Use the SparkPipeline (org.apache.crunch.impl.spark.SparkPipeline) where you would have used the MRPipeline instance in the declaration of your Crunch pipeline. The SparkPipeline will need either a String that contains the connection string for the Spark master (local for local mode, yarn-client for YARN) or an actual JavaSparkContext instance.
  3. Update the SPARK_SUBMIT_CLASSPATH:
    export SPARK_SUBMIT_CLASSPATH=./commons-codec-1.4.jar:$SPARK_HOME/assembly/lib/*:./myapp-jar-with-dependencies.jar
      Important:

    The commons-codec-1.4 dependency must come before the SPARK_HOME dependencies.

  4. Now you can start the pipeline using your Crunch app jar-with-dependencies file using the spark-submit script, just as you would for a regular Spark pipeline.