Running a Crunch Application with Spark

The blog post How-to: Run a Simple Apache Spark App in CDH 5 provides a tutorial on writing, compiling, and running a Spark application. Taking that article as a starting point, do the following to run a Crunch application with Spark.

  1. Add both the crunch-core and crunch-spark dependencies to your Maven project, along with the other dependencies shown in the blog post.
  2. Use the SparkPipeline (org.apache.crunch.impl.spark.SparkPipeline) where you would have used the MRPipeline instance in the declaration of your Crunch pipeline. The SparkPipeline will need either a String that contains the connection string for the Spark master (local for local mode, yarn-client for YARN) or an actual JavaSparkContext instance.
  3. Update the SPARK_SUBMIT_CLASSPATH:
    export SPARK_SUBMIT_CLASSPATH=./commons-codec-1.4.jar:$SPARK_HOME/assembly/lib/*:./myapp-jar-with-dependencies.jar
  4. Now you can start the pipeline using your Crunch application jar-with-dependencies file using the spark-submit script, just as you would for a regular Spark pipeline.