What's New in CDH 5 Beta 2

Apache Crunch

The Apache Crunch™ project develops and supports Java APIs that simplify the process of creating data pipelines on top of Apache Hadoop. The Crunch APIs are modeled after FlumeJava (PDF), which is the library that Google uses for building data pipelines on top of their own implementation of MapReduce. For more information and installation instructions, see Crunch Installation.

Apache DataFu

Upgraded from version 0.4 to 1.1.0 (this upgrade is not backward compatible).
New features include UDFS SHA, SimpleRandomSample, COALESCE, ReservoirSample, EmptyBagToNullFields, and many others.

Apache Flume

FLUME-2294 - Added a new sink to write Kite datasets.
FLUME-2056 - Spooling Directory Source can now only pass the name of the file in the event headers.
FLUME-2155 - File Channel is indexed during replay to improve replay performance for faster startup.
FLUME-2217 - Syslog Sources can optionally preserve all syslog headers in the message body.
FLUME-2052 - Spooling Directory Source can now replace or ignore malformed characters in input files.

Apache Hadoop

HDFS

New Features/Improvements:

As of CDH 5 Beta 2, you can upgrade HDFS with high availability (HA) enabled, if you are using Quorum-based storage. (Quorum-based storage is the only method available in CDH 5; NFS shared storage is not supported.) For upgrade instructions, see Upgrading from CDH 4 to CDH 5.
HDFS-4949 - CDH 5 Beta 2 supports Centralized Cache Management in HDFS.
As of CDH 5 Beta 2, you can configure an NFSv3 gateway that allows any NFSv3-compatible client to mount HDFS as a file system on the client's local file system. For more information and instructions, see Configuring an NFSv3 Gateway.
HDFS-5709 - Improve upgrade with existing files and directories named .snapshot.

Major Bug Fixes:

HDFS-5449- Fix WebHDFS compatibility break.
HDFS-5671- Fix socket leak in DFSInputStream#getBlockReader.
HDFS-5353- Short circuit reads fail when dfs.encrypt.data.transfer is enabled.
HDFS-5438- Flaws in block report processing can cause data loss.

Changed Behavior:

As of CDH 5 Beta 2, in order for the NameNode to start up on a secure cluster, you should have the dfs.web.authentication.kerberos.principal property defined in hdfs-site.xml. This has been documented in the CDH 5 Security Guide. For clusters managed by Cloudera Manager, you do not need to explicitly define this property.
HDFS-5037 - Active NameNode should trigger its own edit log rolls.Clients will now retry for a configurable period when encountering a NameNode in Safe Mode.
The default behavior of the mkdir command has changed. As of CDH 5 Beta 2, if the parent folder does not exist, the -p switch must be explicitly mentioned otherwise the command fails.

MapReduce (MRv1 and YARN)

Fair Scheduler (in YARN and MRv1) now supports advance configuration to automatically place applications in queues.
MapReduce now supports running multiple reducers in uber mode and in local job runner.

Apache HBase

Online Schema Change is now a supported feature.
Online Region Merge is now a supported feature.
Namespaces: CDH 5 Beta 2 includes the namespaces feature which enables different sets of tables to be administered by different administrative users. All upgraded tables will live in the default "hbase" namespace. Administrators may create new namespaces and create tables users with rights to the namespace may administer permissions on the tables within the namespace.
There have been several improvements to HBase’s mean time to recovery (mttr) in the face of Master or RegionServer failures.
- Distributed log splitting has matured, and is always activated. The option to use the old slower splitting mechanism no longer exists.
- Failure detection time has been improved. New notifications are now sent when RegionServers or Masters fail which triggers corrective action quickly.
- The Meta table has a dedicated write ahead log which enables faster recovery region recovery if the RegionServer serving meta goes down.
The Region Balancer has been significantly updated to take more load attributes into account.
Added TableSnapshotInputFormat and TableSnapshotScanner to perform scans over HBase table snapshots from the client side, bypassing the HBase servers. The former configures a MapReduce job, while the latter does a single client-side scan over snapshot files. Can also be used with offline HBase with in-place or exported snapshot files.
The KeyValue API has been deprecated for applications in favor of the Cell interface. Users upgrading to HBase 0.96 may still use KeyValue by future upgrades may remove the class or parts of its functionality. Users are encouraged to update their applications to use the new Cell interface.
Currently Experimental features:
- Distributed log replay: This mechanism allows for faster recovery from RegionServer failures but has one special case where it will violate ACID guarantees. Cloudera does not currently recommend activating this feature.
- Bucket cache: This is an offheap caching mechanism that use extra RAM and block devices (such as flash drives) to greatly increase the read caching capabilities provided by the BlockCache. Cloudera does not currently recommend activating this feature.
- Favored nodes: This feature enables HBase to better control where its data is written to in HDFS in order to better preserve performance after a failure. This is disabled currently because it doesn’t interact well with the HBase Balancer or HDFS Balancer. Cloudera does not currently recommend activating this feature.

See this blog post for more details.

Apache Hive

New Features:

Improved JDBC specification coverage:
- Improvements to getDatabaseMajorVersion(), getDatabaseMinorVersion() APIs (HIVE-3181)
- Added JDBC support for new datatypes: Char (HIVE-5683), Decimal (HIVE-5355) and Varchar (HIVE-5209)
- You can now specify the database for a session in the HiveServer2 connection URL (HIVE-4256)
Encrypted communication between the Hive Server and Clients. This includes SSL encryption for non-Kerberos connections to HiveServer2 (HIVE-5351).
A native Parquet SerDe is now available as part of the CDH 5 Beta 2 package. Users can directly create a Parquet format table without any external package dependency.

Changed Behavior:

HIVE-4256 - With Sentry enabled, the use <database> command is now executed as part of the connection to HiveServer2. Hence, a user with no privileges to access a database will not be allowed to connect to HiveServer2.

Hue

Hue has been upgraded to version 3.5.0.
Impala and Hive Editor are now one-page apps. The Editor, Progress, Table list and Results are all on the same page
Result graphing for the Hive and Impala Editors.
Editor and Dashboard for Oozie SLA, crontab and credentials.
The Sqoop2 app supports autocomplete of database and table names/fields.
DBQuery App: MySQL and PostgreSQL Query Editors.
New Search feature: Graphical facets
Integrate external Web applications in any language. See this blog post for more details.
Create Hive tables and load quoted CSV data. Tutorial available here.
Submit any Oozie jobs directly from HDFS. Tutorial available here
New SAML backend enables single sign-on (SSO) with Hue.

Apache Oozie

Oozie now supports cron-style scheduling capability.
Oozie now supports High Availability with security.

Apache Pig

AvroStorage rewritten for better performance, and moved from piggybank to core Pig
ASSERT, IN, and CASE operators added
ParquetStorage added for integration with Parquet

Cloudera Search

The Cloudera CDK has been renamed and updated to Kite version 0.11.0. For additional information on Kite, see:

Apache Spark (incubating)

Spark is a fast, general engine for large-scale data processing. For installation and configuration instructions, see Spark Installation.

Apache Sqoop

Sqoop 2 has been upgraded from version 1.99.2 to 1.99.3.