The Apache Crunch™ project develops and supports Java APIs that simplify the process of creating data pipelines on top of Apache Hadoop. The Crunch APIs are modeled after FlumeJava, which is the library that Google uses for building data pipelines on top of their own implementation of MapReduce.
The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.
The packaging options for installing Crunch are:
- RPM packages
- Debian packages
- crunch: provides all the functionality of crunch allowing users to create data pipelines over execution engines like MapReduce, Spark, and so on.
- crunch-doc: the documentation package.
Installing and Upgrading Crunch
To install the Crunch packages:
To install or upgrade Crunch on a Red Hat system:
$ sudo yum install crunch
To install or upgrade Crunch on a SLES system:
$ sudo zypper install crunch
To install or upgrade Crunch on an Ubuntu or Debian system:
$ sudo apt-get install crunch
To use the Crunch documentation:
$ sudo apt-get install crunch-docThe contents of this package are saved under /usr/share/doc/crunch*.
After a package installation, the Crunch jars can be found in /usr/lib/crunch.
If you installed CDH 5 through Cloudera Manager, the CDH 5 parcel includes Crunch and the jars are installed automatically as part of the CDH 5 installation. By default the jars will be found in /opt/cloudera/parcels/CDH/lib/crunch.