What's New in CDH4.2.0
Oracle JDK 7 Support
CDH4.2 works with Oracle JDK 7 (JDK 1.7) with the following restrictions:
- All CDH components must be running the same major version (that is, all deployed on JDK 6 or all deployed on JDK 7). For example, you cannot run Hadoop on JDK 6 while running Sqoop on JDK 7.
- Cloudera strongly recommends that applications run against CDH be compiled with JDK 6. Applications compiled with JDK 7 may fail.
- MRv2 (YARN) is not supported on JDK 7 at present, because of MAPREDUCE-2264. This problem is expected to be fixed in an upcoming release.
- To make sure everything works correctly, symbolically link the directory where you install the JDK to /usr/java/default on Red Hat and similar systems, or to /usr/lib/jvm/default-java on Ubuntu and debian systems.
- JobTracker High Availability. See Configuring High Availability for the JobTracker (MRv1) for details.Note:
Cloudera Manager does not yet support JobTracker High Availability; do not attempt to use this capability in a Cloudera-managed deployment.
- Pluggable MapReduce sort. See MAPREDUCE-2454.
- YARN Fair Share Scheduler improvements:
- The Fair Share Scheduler now shows up in the ResourceManager web UI.
- The Fair Share Scheduler contains a number of stability improvements. It now supports ACLs, and a number of bugs that caused the ResourceManager to crash have been fixed.
- The Fair Share Scheduler now supports hierarchical queues. Each parent queue shares resources assigned to it fairly between its children.
- For more information, see http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
- The DATETIME type is now supported. (PIG-1314). For more information, see http://archive.cloudera.com/cdh4/cdh/4/pig-0.10.0-cdh4.2.0/basic.html#data-types.
- The RANK function (as in SQL 99) is now supported (PIG-2353). For more information, see http://archive.cloudera.com/cdh4/cdh/4/pig-0.10.0-cdh4.2.0/basic.html#rank.
Support for pluggable dual authentication (Kerberos + other) with an example; see Integrating Hadoop Security with Alternate Authentication.
Support/Configuration for SSL (HTTPS); see Configuring Oozie to use SSL (HTTPS).
New EL functions for string manipulation: replaceAll and appendAll
New EL function, offset(int n, String timeUnit), for coordinator applications which allows users to specify dataset ranges and instances based on a multiple of timeUnit
New workflow example demonstrating the Shell action
Added a dryrun option for workflows
Added a property to disable forkjoin validation for a specific workflow
- Includes Flume version 1.3.0, and also includes all of the fixes that went into Flume 1.3.1, as well as additional bug fixes and features.
- Added an extensible HTTP source. See FLUME-1199.
- Added an extensible "spooling directory" source to ingest rotated log files into Flume. See FLUME-1425, FLUME-1633.
- Added support for HBase security in the HBase sink. See FLUME-1626.
- Added support for embedding a lightweight Flume agent into client applications. See FLUME-1502.
- Added a JMS source. See FLUME-924.
- Added support for a plugins.d directory. See FLUME-1735.
- Added an interceptor to extract content from events based on regular expressions. See FLUME-1657.
- Added support for customizing how SequenceFiles are written to HDFS. See FLUME-1100.
- Includes Hive version 0.10.0, including nearly 200 bug fixes and enhancements.
- Supports a new data type: DECIMAL, based on Java's BigDecimal which is used for representing immutable arbitrary precision decimal numbers in Java. See Hive Data Types for more information.
- The JDBC driver incorporates support for the the DECIMAL data type.
- Allows computation and persistence of optimizer statistics on columns in both tables and partitions. See HIVE-1362 for more information.
- Supports external Hive tables whose data are stored in an Azure blob store or Azure Storage Volumes (ASV). See HIVE-3146 for more information.
- Supports CUBE, and ROLLUP with group by. See HIVE-3433 and HIVE-2397 for more information.
- Provides the capability to compute optimizer statistics on columns at both the table and partition level, and make them persistent. The analyze statement has been extended to compute statistics on columns and make them persistent. The new syntax is:
analyze table t [partition p] compute statistics for [columns c,...];
Note that if a partition is specified statistics are gathered only for the partition. To gather statistics for the entire table, skip the partition clause.
- CDH4.2 HCatalog is based on Apache Hcatalog 0.4, with a number of improvements and bug fixes picked from the 0.5 release branch.
CDH4.2 HBase is based on Apache HBase 0.94.2, with some improvements and bug fixes picked from the 0.94.3 release. This is a significant upgrade from CDH4.1.3, which was based on Apache HBase 0.92.1.
- Snapshots HBASE-6055
- HLog Compression HBASE-4608: improves write throughput by enabling compression of the Write Ahead Log. This feature is turned off by default.
- Replication improvements:
- Atomic append HBASE-4102: A new API that appends to an existing KeyValue. This does the increment at the serverside (clients don't need to do a read, modify and update at its end).
- Multiple performance improvements, such as lazy-seek optimizations (HBASE-4465).
- Previously, using short-circuit reads required disabling security and reconfiguring the UNIX users and groups on the system. HDFS-347 allows you to run a secure cluster with short-circuit local reads without modifying the UNIX users and groups. There are also some performance improvements.
- The Oozie application has been restyled completely and now supports Ajax refreshes
- A Cloudera Impala application has been added
- The Beeswax/Hive editor is more user-friendly
- FileBrowser has been restyled and now includes bulk and recursive operations (for example, multiple deletes)
- JobBrowser is compatible with YARN and job logs can be accessed in one click
- UserAdmin has been restyled and LDAP integration has been improved
- MySQL, InnoDB, and PostgreSQL are officially supported
- Upgraded to upstream release 1.4.2
- Custom schema support for Microsoft SQL Server and PostgreSQL
- Support for export with use of pg_dump utililty on PostgreSQL
- Table hints support for Microsoft SQL Server
Apache Sqoop 2
- Newly added component in addition to current Sqoop 1 (new client-server version of Sqoop)
- Upgraded to upstream release 1.7.3.
- Added specification and Java implementation of the Trevni columnar file format.