Building and Running a Crunch Application with Spark

Developing and Running a Spark WordCount Application provides a tutorial on writing, compiling, and running a Spark application. Using the tutorial as a starting point, do the following to build and run a Crunch application with Spark:

  1. Along with the other dependencies shown in the tutorial, add the appropriate version of the crunch-core and crunch-spark dependencies to the Maven project.
    <dependency>
      <groupId>org.apache.crunch</groupId>
      <artifactId>crunch-core</artifactId>
      <version>${crunch.version}</version>
      <scope>provided</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.crunch</groupId>
      <artifactId>crunch-spark</artifactId>
      <version>${crunch.version}</version>
      <scope>provided</scope>
    </dependency>
    
    
  2. Use SparkPipeline where you would have used MRPipeline in the declaration of your Crunch pipeline. SparkPipeline takes either a String that contains the connection string for the Spark master (local for local mode, yarn for YARN) or a JavaSparkContext instance.
  3. As you would for a Spark application, use spark-submit start the pipeline with your Crunch application app-jar-with-dependencies.jar file.

For an example, see Crunch demo. After building the example, run with the following command:

spark-submit --class com.example.WordCount crunch-demo-1.0-SNAPSHOT-jar-with-dependencies.jar \
hdfs://namenode_host:8020/user/hdfs/input hdfs://namenode_host:8020/user/hdfs/output