This is the documentation for Cloudera 5.4.x. Documentation for other versions is available at Cloudera Documentation.

Running a Crunch Application with Spark

The blog post How-to: Run a Simple Apache Spark App in CDH 5 provides a tutorial on writing, compiling, and running a Spark application. Taking that article as a starting point, do the following to run a Crunch application with Spark.
  1. Add both the crunch-core and crunch-spark dependencies to your Maven project, along with the other dependencies shown in the blog post.
  2. Use the SparkPipeline (org.apache.crunch.impl.spark.SparkPipeline) where you would have used the MRPipeline instance in the declaration of your Crunch pipeline. The SparkPipeline will need either a String that contains the connection string for the Spark master (local for local mode, yarn-client for YARN) or an actual JavaSparkContext instance.
    export SPARK_SUBMIT_CLASSPATH=./commons-codec-1.4.jar:$SPARK_HOME/assembly/lib/*:./myapp-jar-with-dependencies.jar
      Important: The commons-codec-1.4 dependency must come before the SPARK_HOME dependencies.
  4. Now you can start the pipeline using your Crunch application jar-with-dependencies file using the spark-submit script, just as you would for a regular Spark pipeline.
Page generated August 31, 2015.