Apache Crunch Guide
The Apache Crunch™ project develops and supports Java APIs that simplify the process of creating data pipelines on top of Apache Hadoop. The Crunch APIs are modeled after FlumeJava, which is the library that Google uses for building data pipelines on top of their own implementation of MapReduce.
The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.
Cloudera supports Crunch on the same operating systems and Java versions as CDH. For more information, see CDH and Cloudera Manager Supported Operating Systems and Java Requirements.
As of CDH 6.0.0, Apache Crunch is no longer available as an RPM, Debian package, parcel, or tarball. To use Crunch with CDH 6, you must configure your Java or Scala project dependencies to include the Crunch libraries. You can do this using your build tools that support Maven or Ivy, or by including the JAR files from the repository in the classpath.
Maven Repository URL: https://repository.cloudera.com/artifactory/cloudera-repos/
Adding Crunch Maven Dependencies
The following examples show how to add Apache Crunch dependencies to your projects:
- Using the Cloudera repository in a Maven project:
<repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository>
- Specifying the Crunch dependency in a Maven project with MapReduce 2 (YARN):
<dependency> <groupId>org.apache.crunch</groupId> <artifactId>crunch-core</artifactId> <version>0.11.0-cdh6.0.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>3.0.0-cdh6.0.0</version> <scope>provided</scope> </dependency>
- Specifying the Crunch dependency in a Maven project with Spark 2 (YARN):
<dependency> <groupId>org.apache.crunch</groupId> <artifactId>crunch-core</artifactId> <version>0.11.0-cdh6.0.0</version> </dependency> <dependency> <groupId>org.apache.crunch</groupId> <artifactId>crunch-spark</artifactId> <version>0.11.0-cdh6.0.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.0.0-cdh6.0.0</version> </dependency>
Building Your Project
Because CDH 6 clusters do not contain the Crunch libraries, you must create a so-called fat or uber JAR of your project that also contains Crunch. Make sure that the uber JAR contains only Crunch dependencies, and not any other CDH dependencies.
Alternatively, you can download the Crunch JAR files from the Maven repository and add it to the classpath when you are running the job. Cloudera recommends using the uber JAR method.
The following example demonstrates using the Maven Shade plugin to create a JAR that contains only the Crunch dependencies:
<build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.0.0</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <artifactSet> <includes> <include>org.apache:crunch:*</include> </includes> </artifactSet> </configuration> </execution> </executions> </plugin> </plugins> </build>