CDH 4.2.0

Cloudera’s 100% Open Source Hadoop Platform

CDH is Cloudera's open source software distribution and consists of Apache Hadoop and additional key open source projects to ensure you get the most out of Hadoop and your data.

It is the only Hadoop solution to offer unified querying options (including batch processing, interactive SQL, text search, and machine learning) and necessary enterprise security features (such as role-based access controls).

Please note: CDH requires manual installation from the command line.
For a faster, automated installation download Cloudera Manager.

What's New in CDH4.2.0

Oracle JDK 7 Support

CDH4.2 works with Oracle JDK 7 (JDK 1.7) with the following restrictions:

  • All CDH components must be running the same major version (that is, all deployed on JDK 6 or all deployed on JDK 7). For example, you cannot run Hadoop on JDK 6 while running Sqoop on JDK 7.
  • Cloudera strongly recommends that applications run against CDH be compiled with JDK 6. Applications compiled with JDK 7 may fail.
  • MRv2 (YARN) is not supported on JDK 7 at present, because of MAPREDUCE-2264. This problem is expected to be fixed in an upcoming release.
  • To make sure everything works correctly, symbolically link the directory where you install the JDK to /usr/java/default on Red Hat and similar systems, or to /usr/lib/jvm/default-java on Ubuntu and debian systems.

Apache MapReduce

  • Pluggable MapReduce sort. See MAPREDUCE-2454.
  • YARN Fair Share Scheduler improvements:
    • The Fair Share Scheduler now shows up in the ResourceManager web UI.
    • The Fair Share Scheduler contains a number of stability improvements. It now supports ACLs, and a number of bugs that caused the ResourceManager to crash have been fixed.
    • The Fair Share Scheduler now supports hierarchical queues. Each parent queue shares resources assigned to it fairly between its children.
    • For more information, see

Apache Pig

Apache Oozie

  • Support for pluggable dual authentication (Kerberos + other) with an example; see Integrating Hadoop Security with Alternate Authentication.

  • Support/Configuration for SSL (HTTPS); see Configuring Oozie to use SSL (HTTPS).

  • New EL functions for string manipulation: replaceAll and appendAll

  • New EL function, offset(int n, String timeUnit), for coordinator applications which allows users to specify dataset ranges and instances based on a multiple of timeUnit

  • New workflow example demonstrating the Shell action

  • Added a dryrun option for workflows

  • Added a property to disable forkjoin validation for a specific workflow

Apache Flume

  • Includes Flume version 1.3.0, and also includes all of the fixes that went into Flume 1.3.1, as well as additional bug fixes and features.
  • Added an extensible HTTP source. See FLUME-1199.
  • Added an extensible "spooling directory" source to ingest rotated log files into Flume. See FLUME-1425, FLUME-1633.
  • Added support for HBase security in the HBase sink. See FLUME-1626.
  • Added support for embedding a lightweight Flume agent into client applications. See FLUME-1502.
  • Added a JMS source. See FLUME-924.
  • Added support for a plugins.d directory. See FLUME-1735.
  • Added an interceptor to extract content from events based on regular expressions. See FLUME-1657.
  • Added support for customizing how SequenceFiles are written to HDFS. See FLUME-1100.

Apache Hive

  • Includes Hive version 0.10.0, including nearly 200 bug fixes and enhancements.
  • Supports a new data type: DECIMAL, based on Java's BigDecimal which is used for representing immutable arbitrary precision decimal numbers in Java. See Hive Data Types for more information.
  • The JDBC driver incorporates support for the the DECIMAL data type.
  • Allows computation and persistence of optimizer statistics on columns in both tables and partitions. See HIVE-1362 for more information.
  • Supports external Hive tables whose data are stored in an Azure blob store or Azure Storage Volumes (ASV). See HIVE-3146 for more information.
  • Supports CUBE, and ROLLUP with group by. See HIVE-3433 and HIVE-2397 for more information.
  • Provides the capability to compute optimizer statistics on columns at both the table and partition level, and make them persistent. The analyze statement has been extended to compute statistics on columns and make them persistent. The new syntax is:
    analyze  table t [partition p] compute statistics for [columns c,...];

Note that if a partition is specified statistics are gathered only for the partition. To gather statistics for the entire table, skip the partition clause.

Apache HCatalog

  • CDH4.2 HCatalog is based on Apache Hcatalog 0.4, with a number of improvements and bug fixes picked from the 0.5 release branch.

Apache HBase

CDH4.2 HBase is based on Apache HBase 0.94.2, with some improvements and bug fixes picked from the 0.94.3 release. This is a significant upgrade from CDH4.1.3, which was based on Apache HBase 0.92.1.

Major changes:

  • Snapshots HBASE-6055
  • HLog Compression HBASE-4608: improves write throughput by enabling compression of the Write Ahead Log. This feature is turned off by default.
  • Replication improvements:
    • Support for replication stream. This allows you to start and stop replication at the peer level. See Disabling Replication at the Peer Level.
    • Compatibility with HLog compression: HBASE-5778 is resolved, allowing you to use HLog compression in tandem with HBase replication.
  • Atomic append HBASE-4102: A new API that appends to an existing KeyValue. This does the increment at the serverside (clients don't need to do a read, modify and update at its end).
  • Multiple performance improvements, such as lazy-seek optimizations (HBASE-4465).

Apache HDFS

  • Previously, using short-circuit reads required disabling security and reconfiguring the UNIX users and groups on the system. HDFS-347 allows you to run a secure cluster with short-circuit local reads without modifying the UNIX users and groups. There are also some performance improvements.


  • The Oozie application has been restyled completely and now supports Ajax refreshes
  • A Cloudera Impala application has been added
  • The Beeswax/Hive editor is more user-friendly
  • FileBrowser has been restyled and now includes bulk and recursive operations (for example, multiple deletes)
  • JobBrowser is compatible with YARN and job logs can be accessed in one click
  • UserAdmin has been restyled and LDAP integration has been improved
  • MySQL, InnoDB, and PostgreSQL are officially supported

Apache Sqoop

  • Upgraded to upstream release 1.4.2
  • Custom schema support for Microsoft SQL Server and PostgreSQL
  • Support for export with use of pg_dump utililty on PostgreSQL
  • Table hints support for Microsoft SQL Server

Apache Sqoop 2

  • Newly added component in addition to current Sqoop 1 (new client-server version of Sqoop)

Apache Avro

  • Upgraded to upstream release 1.7.3.
  • Added specification and Java implementation of the Trevni columnar file format.

CDH 4.x Requirements and Supported Versions

Supported Operating Systems

CDH4 provides packages for Red-Hat-compatible, SLES, Ubuntu, and Debian systems as described below.

Operating System



Red Hat compatible

Red Hat Enterprise Linux (RHEL)




64-bit, 32-bit







64-bit, 32-bit



Oracle Linux with Unbreakable Enterprise Kernel






SLES Linux Enterprise Server (SLES)

11 with Service Pack 1 or later




Lucid (10.04) - Long-Term Support (LTS)


Precise (12.04) - Long-Term Support (LTS)



Squeeze (6.0.3)


  • For production environments, 64-bit packages are recommended. Except as noted above, CDH4 provides only 64-bit packages.
  • Cloudera has received reports that our RPMs work well on Fedora, but we have not tested this.
  • If you are using an operating system that is not supported by Cloudera's packages, you can also download source tarballs from Downloads.

Supported Databases

Supported JDK versions

Supported Internet Protocol