Managing Hadoop API Dependencies in CDH 5

In CDH 3, all of the Hadoop API implementations were confined to a single JAR file (hadoop-core) plus a few of its dependencies. It was relatively straightforward to make sure that classes from these JAR files were available at runtime.

CDH 4 and CDH 5 are more complex: they bundle both MRv1 and MRv2 (YARN). To simplify things, CDH 4 and CDH 5 provide a Maven-based way of managing client-side Hadoop API dependencies that saves you from having to figure out the exact names and locations of all the JAR files needed to provide Hadoop APIs.

In CDH 5, Cloudera recommends that you use a hadoop-client artifact for all clients, instead of managing JAR-file-based dependencies manually.

Flavors of the hadoop-client Artifact

There are two different flavors of the hadoop-client artifact: a Maven-based Project Object Model (POM) artifact and a Linux package, hadoop-client. The former lets you manage Hadoop API dependencies at both compile and run time for your Maven- or Ivy-based projects; the latter provides a familiar interface in the form of a collection of JAR files that can be added to your classpath directly.

Versions of the hadoop-client Artifact

CDH 5 provides two distinct versions of the hadoop-client artifact: one for MRv1 and one for MRv2 (YARN). If you're using the Maven-based POM hadoop-client artifact, you can use the version string to distinguish between them: 2.2.0-mr1-cdh5.x.x for MRv1 APIs and 2.2.0-cdh5.x.x for YARN, substituting x for the version number. If you're using the Linux package, you can distinguish by the location of the JAR files: /usr/lib/hadoop/client-0.20 for MRv1 APIs and /usr/lib/hadoop/client for YARN.

Using hadoop-client for Maven-based Java Projects

Make sure you add the following dependency specification to your pom.xml file:

  <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-client</artifactId>
     <version>VERSION</version>
   </dependency>

where the <VERSION> string can be either 2.2.0-cdh5.x.x for YARN APIs or 2.2.0-mr1-cdh5.x.x for MRv1 APIs, substituting x for the version number.

Using hadoop-client for Ivy-based Java Projects

Make sure you add the following dependency specification to your ivy.xml file:

  <dependency org="org.apache.hadoop" name="hadoop-client" rev="VERSION"/>

where the <VERSION> string can be either 2.2.0-cdh5.x.x for YARN APIs or 2.2.0-mr1-cdh5.x.x for MRv1 APIs, substituting x for the version number.

Using JAR Files Provided in the hadoop-client Package

Make sure you add to your project all of the JAR files provided under /usr/lib/hadoop/client-0.20 (for MRv1 APIs) or /usr/lib/hadoop/client (for YARN).

For example, you can add this location to the JVM classpath:

$ export CLASSPATH=/usr/lib/hadoop/client-0.20/\*