Running Crunch with Spark
The blog post How-to: Run a Simple Apache Spark App in CDH 5 provides a tutorial on writing, compiling and running a Spark application. Taking that article as a starting point, do the following to run Crunch with Spark.
- Add both the crunch-core and crunch-spark dependencies to your Maven project, along with the other dependencies shown in the blog post.
- Use the SparkPipeline (org.apache.crunch.impl.spark.SparkPipeline) where you would have used the MRPipeline instance in the declaration of your Crunch pipeline. The SparkPipeline will need either a String that contains the connection string for the Spark master (local for local mode, yarn-client for YARN) or an actual JavaSparkContext instance.
- Update the
The commons-codec-1.4 dependency must come before the SPARK_HOME dependencies.
- Now you can start the pipeline using your Crunch app jar-with-dependencies file using the spark-submit script, just as you would for a regular Spark pipeline.