This is the documentation for CDH 4.7.0.
Documentation for other versions is available at Cloudera Documentation.

What's New in CDH4.2.0

Oracle JDK 7 Support

CDH4.2 works with Oracle JDK 7 (JDK 1.7) with the following restrictions:
  • All CDH components must be running the same major version (that is, all deployed on JDK 6 or all deployed on JDK 7). For example, you cannot run Hadoop on JDK 6 while running Sqoop on JDK 7.
  • Cloudera strongly recommends that applications run against CDH be compiled with JDK 6. Applications compiled with JDK 7 may fail.
  • MRv2 (YARN) is not supported on JDK 7 at present, because of MAPREDUCE-2264. This problem is expected to be fixed in an upcoming release.
  • To make sure everything works correctly, symbolically link the directory where you install the JDK to /usr/java/default on Red Hat and similar systems, or to /usr/lib/jvm/default-java on Ubuntu and Debian systems.

Apache MapReduce

  • Pluggable MapReduce sort. See MAPREDUCE-2454.
  • YARN Fair Share Scheduler improvements:
    • The Fair Share Scheduler now shows up in the ResourceManager web UI.
    • The Fair Share Scheduler contains a number of stability improvements. It now supports ACLs, and a number of bugs that caused the ResourceManager to crash have been fixed.
    • The Fair Share Scheduler now supports hierarchical queues. Each parent queue shares resources assigned to it fairly between its children.
    • For more information, see http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.

Apache Pig

Apache Oozie

  • Support for pluggable dual authentication (Kerberos + other) with an example; see Integrating Hadoop Security with Alternate Authentication

  • Support/Configuration for SSL (HTTPS); see Configuring Oozie to use SSL (HTTPS)

  • New EL functions for string manipulation: replaceAll and appendAll

  • New EL function, offset(int n, String timeUnit), for coordinator applications which allows users to specify dataset ranges and instances based on a multiple of timeUnit

  • New workflow example demonstrating the Shell action

  • Added a dryrun option for workflows

  • Added a property to disable forkjoin validation for a specific workflow

Apache Flume

  • Includes Flume version 1.3.0, and also includes all of the fixes that went into Flume 1.3.1, as well as additional bug fixes and features.
  • Added an extensible HTTP source. See FLUME-1199.
  • Added an extensible "spooling directory" source to ingest rotated log files into Flume. See FLUME-1425, FLUME-1633.
  • Added support for HBase security in the HBase sink. See FLUME-1626.
  • Added support for embedding a lightweight Flume agent into client applications. See FLUME-1502.
  • Added a JMS source. See FLUME-924.
  • Added support for a plugins.d directory. See FLUME-1735.
  • Added an interceptor to extract content from events based on regular expressions. See FLUME-1657.
  • Added support for customizing how SequenceFiles are written to HDFS. See FLUME-1100.

Apache Hive

  • Includes Hive version 0.10.0, including nearly 200 bug fixes and enhancements.
  • Supports a new data type: DECIMAL, based on Java's BigDecimal which is used for representing immutable arbitrary precision decimal numbers in Java. See Hive Data Types for more information.
  • The JDBC driver incorporates support for the the DECIMAL data type.
  • Allows computation and persistence of optimizer statistics on columns in both tables and partitions. See HIVE-1362 for more information.
  • Supports external Hive tables whose data are stored in an Azure blob store or Azure Storage Volumes (ASV). See HIVE-3146 for more information.
  • Supports CUBE, and ROLLUP with group by. See HIVE-3433 and HIVE-2397 for more information.
  • Provides the capability to compute optimizer statistics on columns at both the table and partition level, and make them persistent. The analyze statement has been extended to compute statistics on columns and make them persistent. The new syntax is:
    analyze  table t [partition p] compute statistics for [columns c,...];

Note that if a partition is specified statistics are gathered only for the partition. To gather statistics for the entire table, skip the partition clause.

Apache HCatalog

  • CDH4.2 HCatalog is based on Apache Hcatalog 0.4, with a number of improvements and bug fixes picked from the 0.5 release branch.

Apache HBase

CDH4.2 HBase is based on Apache HBase 0.94.2, with some improvements and bug fixes picked from the 0.94.3 release. This is a significant upgrade from CDH4.1.3, which was based on Apache HBase 0.92.1.

Major changes:

  • Snapshots HBASE-6055
  • HLog Compression HBASE-4608: improves write throughput by enabling compression of the Write Ahead Log. This feature is turned off by default.
  • Replication improvements:
    • Support for replication stream. This allows you to start and stop replication at the peer level. See Disabling Replication at the Peer Level.
    • Compatibility with HLog compression: HBASE-5778 is resolved, allowing you to use HLog compression in tandem with HBase replication.
  • Atomic append HBASE-4102: A new API that appends to an existing KeyValue. This does the increment at the serverside (clients don't need to do a read, modify and update at its end).
  • Multiple performance improvements, such as lazy-seek optimizations (HBASE-4465).

Apache HDFS

  • Previously, using short-circuit reads required disabling security and reconfiguring the UNIX users and groups on the system. HDFS-347 allows you to run a secure cluster with short-circuit local reads without modifying the UNIX users and groups. There are also some performance improvements.

Hue

  • The Oozie application has been restyled completely and now supports Ajax refreshes
  • A Cloudera Impala application has been added
  • The Beeswax/Hive editor is more user-friendly
  • FileBrowser has been restyled and now includes bulk and recursive operations (for example, multiple deletes)
  • JobBrowser is compatible with YARN and job logs can be accessed in one click
  • UserAdmin has been restyled and LDAP integration has been improved
  • MySQL, InnoDB, and PostgreSQL are officially supported

Apache Sqoop

  • Upgraded to upstream release 1.4.2
  • Custom schema support for Microsoft SQL Server and PostgreSQL
  • Support for export with use of pg_dump utililty on PostgreSQL
  • Table hints support for Microsoft SQL Server

Apache Sqoop 2

  • Newly added component in addition to current Sqoop 1 (new client-server version of Sqoop)

Apache Avro

  • Upgraded to upstream release 1.7.3.
  • Added specification and Java implementation of the Trevni columnar file format.