Building and Running a Crunch Application with Spark

Developing and Running a Spark WordCount Application provides a tutorial on writing, compiling, and running a Spark application. Using the tutorial as a starting point, do the following to build and run a Crunch application with Spark:

Along with the other dependencies shown in the tutorial, add the appropriate version of the crunch-core and crunch-spark dependencies to the Maven project.

<dependency>
  <groupId>org.apache.crunch</groupId>
  <artifactId>crunch-core</artifactId>
  <version>${crunch.version}</version>
  <scope>provided</scope>
</dependency>
<dependency>
  <groupId>org.apache.crunch</groupId>
  <artifactId>crunch-spark</artifactId>
  <version>${crunch.version}</version>
  <scope>provided</scope>
</dependency>

Use SparkPipeline where you would have used MRPipeline in the declaration of your Crunch pipeline. SparkPipeline takes either a String that contains the connection string for the Spark master (local for local mode, yarn for YARN) or a JavaSparkContext instance.
As of CDH 6.0.0, CDH does not include Crunch jars by default. When you are building your project, create an uber JAR that contains the Crunch libraries. Make sure that the uber JAR does not contain any other CDH dependencies. For more information and example configurations, see Apache Crunch Guide.
As you would for a Spark application, use spark-submit start the pipeline with your Crunch application app-jar-with-dependencies.jar file.

For an example, see Crunch demo. After building the example, run with the following command:

spark-submit --class com.example.WordCount crunch-demo-1.0-SNAPSHOT-jar-with-dependencies.jar \
hdfs://namenode_host:8020/user/hdfs/input hdfs://namenode_host:8020/user/hdfs/output

Categories: Crunch | Developers | Spark | All Categories

Spark and Hadoop Integration

File Formats and Compression