What's New In CDH 5 Beta Releases

Use these links to go to a specific release.

What's New in CDH 5 Beta 1

Oracle JDK 7 Support

  • CDH 5 supports Oracle JDK 1.7 and supports users running applications compiled with JDK 1.7. For CDH 5 Beta 1 the certified version is JDK 1.7.0_25. Cloudera has tested this version across all components.
  • CDH 5 does not support JDK 1.6; you must install JDK 1.7, as instructed here.

Apache Flume

New Features:
  • FLUME-2190 - Includes a new Twitter Source that feeds off the Twitter firehose
  • FLUME-2109 - HTTP Source now supports HTTPS
  • Flume now auto-detects Cloudera Search dependencies.

Apache Hadoop

HDFS

New Features:
  • HDFS-4953: Enable HDFS local reads via mmap.
  • HDFS-2802: Support for RW/RO snapshots in HDFS. See: hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsNfsGateway.apt.vm
  • HDFS-4750: Support NFSv3 interface to HDFS.
  • HDFS-4817: Make HDFS advisory caching configurable on a per-file basis.
  • HDFS-3601: Add BlockPlacementPolicyWithNodeGroup to support block placement with 4-layer network topology.
  • HDFS-5122: Support failover and retry in WebHdfsFileSystem for NN HA.
  • HDFS-4772 / HDFS-5043: Add number of children (of a directory) in HdfsFileStatus.
  • HDFS-4434: Provide a mapping from INodeId to INode.See: /.reserved/.inodes/<INODE_NUMBER>
  • HDFS-2576: Enhances the DistributedFileSystem's Create API so that clients can specify favored DataNodes for a file's blocks.
Changed Features:
  • HDFS-4659: Support setting execution bit for regular files.
    • Impact: In CDH 5, files copied out of copyToLocal may now have the executable bit set if it was set when they were created or copied into HDFS.
  • HDFS-4594: WebHDFS open sets Content-Length header to what is specified by length parameter rather than how much data is actually returned.
    • Impact: In CDH 5, Content-Length header will contain the number of bytes actually returned, rather than the request length.
Changed Behavior:
  • HDFS-4645: Move from randomly generated block ID to sequentially generated block ID.
  • HDFS-4451: HDFS balancer command returns exit code 1 on success instead of 0.

MapReduce v2 (YARN)

New Features:
  • ResourceManager High Availability: YARN now allows you to use multiple ResourceManagers so that there is no single point of failure. In-flight jobs are recovered without re-running completed tasks.
  • Monitoring and enforcing memory and CPU-based resource utilization using cgroups.
  • Continuous Scheduling: This feature decouples scheduling from the node heartbeats for improved performance in large clusters.
Changed Feature:
  • ResourceManager Restart: Persistent implementations of the RMStateStore (filesystem-based and ZooKeeper-based) allow recovery of in-flight jobs.

Apache HBase

Summary of New Features

  • Support for Hadoop 2.0
  • Improved MTTR (meta first recovery, distributed log replay)
  • Improved compatibility and upgradeability (ProtoBuf serialization format)
  • Namespaces added for administrative domains
  • Snapshots (ported to 0.94 / CDH4.2)
  • Online region merge mechanisms added
  • Major security and functional improvements made for the REST proxy server

Administrative Features

ProtoBuf: All of the serialization that goes across the wire between servers written to and read by HBase file formats have been converted to extensible Protobuf encodings. This breaks compatibility with previous versions but should make future extensions less likely to break compatibility in these areas. This feature is enabled by default.
  • HBASE-5305: Improve cross-version compatibility and upgradeability.
  • HBASE-7898: Serializing cells over RPC.
Namespaces: Namespaces is a new feature that groups tables into different administrative domains. An admin can be only given rights to act upon a particular namespace. This feature is enabled by default and requires file system layout changes that must be completed during upgrade.
MTTR Improvements: Mean time to recovery has greatly improved.
  • HBASE-7590: “Costless” notifications from master to rs/clients.
  • HBASE-7213 / HBASE-8631: New .meta suffix to separate HLog file / Recover Meta before other regions in case of server crash.
  • HBASE-7006: Distributed log replay (Caveat).
  • HBASE-9116: Adds a view/edit tool for favored node mappings for regions (incomplete, likely a dot version).
Metrics: There are several new metrics and a new naming convention for metrics in HBase. This also includes metrics for each region.
  • HBASE-3614: Per region metrics.
  • HBASE-4050: Rationalize metrics; Update HBase metrics framework to metrics2.
Miscellaneous:
  • HBASE-7403: HBase online region merge.
  • Shell improvements; tables list to be more well-rounded.
  • HBASE-5953: Expose the current state of the balancerSwitch.
  • HBASE-5934: Add the ability for Performance Evaluation to set table compression.
  • HBASE-6135: New Web UI.
  • HBASE-8148: Allow IPC to bind to a specific address (also 0.94.7)
  • HBASE-5498: Secure Bulk Load (also 0.94.5)

Backup and Disaster Recovery Features

Replication: Several critical bug fixes.
  • HBASE-9373: Replication has been hardened.
  • HBASE-9158: Serious bug in cyclic replication.
  • HBASE-8737: Changes to the replication RPC to use cell blocks.
Snapshots: HBase table snapshots were backported to 0.94.x. There are some incompatibilities between the implementation released in CDH 4 with that in CDH 5.
  • HBASE-7290: Online snapshots (backported to 0.94.x).
  • HBASE-8352: Rename snapshots folder from .snapshot to .hbase-snapshots (Incompatible change).
Copy table:
  • HBASE-8609: Add startRow-stopRow options to the CopyTable.
Import:

HBase Proxies

The REST server now supports Hadoop authentication and authorization mechanisms. The Avro gateway has been removed while the Thrift2 proxy has made progress but is not complete. However, it has been included as a preview feature.

REST:
Thrift:
  • HBASE-5879: Enable JMX metrics collection for the Thrift proxy.

Thrift2: Ongoing efforts to match Thrift and REST functionality. (Incomplete, only a preview feature)

Avro:

Stability Features

There have been several bug fixes, test fixes and configuration default changes that greatly increase our confidence in the stability of the 0.96.0 release. The main improvement comes from the use of a systematic fault-injection framework.

Performance Features

Several features have been added to improve throughput and performance characteristics of HBase and its clients.
Throughput:
Predictable Performance:
Miscellaneous:
  • HBASE-6870: Improvement to HTable coprocessorExec scan performance.

Developer Features

These features are to aid application developers or for major changes that will enable future minor version improvements.

  • HBASE-9121: HTrace updates.
  • HBASE-8375: Durability setting per table.
    • HBASE-7801 Deferred sync for WAL logs (0.94.7 and later)
  • HBASE-7897: Tags supported in cell interface (for future security features).
  • HBASE-5937: Refactor HLog into interface (allows for new HLogs in 0.96.x).
  • HBASE-4336: Modularization of POM / Multiple jars (many follow-ons, HBASE-7898).
  • HBASE-8224: Publish -hadoop1 and -hadoop2 versioned jars to Maven (CDH published jars are assumed -hadoop2).
  • HBASE-9164: Move towards Cell interface in client instead of KeyValue.
  • HBASE-7898: Serializing cells over RPC.
  • HBASE-7725: Add ability to create custom compaction request.

Hue

New Features:
  • With the Sqoop 2 application, data from databases can be easily exported or imported into HDFS in a scalable manner. The Job Wizard hides the complexity of creating Sqoop jobs and the dashboard offers live progress and log access.
  • Zookeeper App: Navigate and browse the Znode hierarchy and content of a Zookeeper cluster. Znodes can be added, deleted and edited. Multi-clusters are supported and various statistics are available for them.
  • The Hue Shell application has been removed and replaced by the Pig Editor, HBase Browser and the Sqoop 1 apps.
  • Python 2.6 is required.
  • Beeswax daemon has been replaced by HiveServer2.
  • CDH 5 Hue will only work with HiveServer2 from CDH 5. No support for impersonation.
Hue also includes the following changed features (Updated to upstream version 3.0.0):
  • [HUE-897] - [core] Redesign of the overall layout
  • [HUE-1521] - [core] Improve JobTracker High Availability
  • [HUE-1493] - [beeswax] Replace the Beeswax server with HiveServer2
  • [HUE-1474] - [core] Upgrade Django backend version from 1.2 to 1.4
  • [HUE-1506]- [search] Impersonation support added
  • [HUE-1475] - [core] Switch back from the Spawning web server
  • [HUE-917] - Support SAML based authentication to enable single sign-on (SSO)
From master:
  • [HUE-950] - [core] Improvements to the document model
  • [HUE-1595] - Integrate Metastore data into Hive and Impala Query UIs
  • [HUE-1275] - [metastore] Show Metastore table details
  • [HUE-1622] - [core] Mini tour added to Hue home page

Apache Hive and HCatalog

New Features (Updated to upstream version 0.11.0):

  • [HIVE-446] - Implement TRUNCATE for table data
  • [HIVE-896] - Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive
  • [HIVE-2693] - Add DECIMAL data type
  • [HIVE-3834] - Support ALTER VIEW AS SELECT in Hive

Performance improvements (from 0.12):

  • [HIVE-3764] - Support metastore version consistency check
  • [HIVE-305] - Port Hadoop streaming process's counters/status reporters to Hive Transforms
  • [HIVE-1402] - Add parallel ORDER BY to Hive
  • [HIVE-2206] - Add a new optimizer for query correlation discovery and optimization
  • [HIVE-2517] - Support GROUP BY on struct type
  • [HIVE-2655] - Ability to define functions in HQL
  • [HIVE-4911] - Enable QOP configuration for HiveServer2 Thrift transport

Cloudera Impala

Cloudera Impala 1.2.0 is now available as part of CDH 5. For more details on Impala, refer the Impala Documentation.

Llama

Llama is a system that mediates resource management between Cloudera Impala and Hadoop YARN. Llama enables Impala to reserve, use, and release resource allocations in a Hadoop cluster. Llama is only required if resource management is enabled in Impala.

See Managing the Impala Llama ApplicationMaster for more information.

Apache Mahout

New Features (Updated to Mahout 0.8):

  • Numerous performance improvements to Vector and Matrix implementations, APIs and their iterators (see also MAHOUT-1192, MAHOUT-1202)
  • Numerous performance improvements to the recommender implementations (see also MAHOUT-1272, MAHOUT-1035, MAHOUT-1042, MAHOUT-1151, MAHOUT-1166, MAHOUT-1167, MAHOUT-1169, MAHOUT-1205, MAHOUT-1264)
  • MAHOUT-1088: Support for biased item-based recommender.
  • MAHOUT-1089: SGD matrix factorization for rating prediction with user and item biases.
  • MAHOUT-1106: Support for SVD++
  • MAHOUT-944: Support for converting one or more Lucene storage indexes to SequenceFiles as well as an upgrade of the supported Lucene version to Lucene 4.3.
  • MAHOUT-1154 and related: New streaming k-means implementation that offers online (and fast) clustering.
  • MAHOUT-833: Make conversion to SequenceFiles Map-Reduce. 'seqdirectory' can now be run as a MapReduce job.
  • MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values).
  • MAHOUT-884: Matrix concatenate utility; presently only concatenates two matrices.

Apache Oozie

New Features:

  • Updated to Oozie 4.0.0.
  • High Availability: Multiple Oozie servers can now be utilized to provide an HA Oozie service as well as provide horizontal scalability. See upstream documentation for more details.
  • HCatalog Integration: HCatalog table partitions can now be used as data dependencies in coordinators. See upstream documentation for more details. .
  • SLA Monitoring: Oozie can now actively monitor SLA-sensitive jobs and send out notifications for SLA meets and misses. SLA information is also now available through a new SLA tab in the Oozie Web UI, JMS messages, and a REST API. See upstream documentation.
  • JMS Notifications: Oozie can now publish notifications to a JMS Provider about job status changes and SLA events. See upstream documentation.
  • The FileSystem action can now use glob patterns for file paths when doing move, delete, chmod, and chgrp.

Cloudera Search

Cloudera Search 1.0.0 is now available as part of CDH 5. For more details on Search see the Search documentation.

The Cloudera Development Kit (CDK) is a set of libraries and tools that can be used with Search and other CDH components to build jobs/systems on top of the Hadoop ecosystem. See the CDK Documentation and Release Notes for more details.

Apache Sentry (incubating)

CDH 5 Beta 1 includes the first upstream release of Apache Sentry, sentry-1.2.0-incubating.

Apache Sqoop

CDH 5 Sqoop 1 has been rebased on Apache Sqoop 1.4.4.

What's New in CDH 5 Beta 2

This is a beta release which previews new features, changes, and fixed issues. See also Issues Fixed in CDH 5 Beta 2.

New Features and Changes in CDH 5 Beta 2

CDH 5 Beta 2 introduces the following new features and changes, organized by component.

Apache Crunch

The Apache Crunch™ project develops and supports Java APIs that simplify the process of creating data pipelines on top of Apache Hadoop. The Crunch APIs are modeled after FlumeJava (PDF), which is the library that Google uses for building data pipelines on top of their own implementation of MapReduce. For more information and installation instructions, see cdh_ig_crunch_installation.html#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--7ed4.

Apache DataFu

  • Upgraded from version 0.4 to 1.1.0 (this upgrade is not backward compatible).
  • New features include UDFS SHA, SimpleRandomSample, COALESCE, ReservoirSample, EmptyBagToNullFields, and many others.

Apache Flume

  • FLUME-2294 - Added a new sink to write Kite datasets.
  • FLUME-2056 - Spooling Directory Source can now only pass the name of the file in the event headers.
  • FLUME-2155 - File Channel is indexed during replay to improve replay performance for faster startup.
  • FLUME-2217 - Syslog Sources can optionally preserve all syslog headers in the message body.
  • FLUME-2052 - Spooling Directory Source can now replace or ignore malformed characters in input files.

Apache Hadoop

HDFS

New Features/Improvements:

  • As of CDH 5 Beta 2, you can upgrade HDFS with high availability (HA) enabled, if you are using Quorum-based storage. (Quorum-based storage is the only method available in CDH 5; NFS shared storage is not supported.) For upgrade instructions, see Upgrading from CDH 4 to CDH 5.
  • HDFS-4949 - CDH 5 Beta 2 supports HDFS caching. For more information, see Configuring Centralized Cache Management in HDFS.
  • As of CDH 5 Beta 2, you can configure an NFSv3 gateway that allows any NFSv3-compatible client to mount HDFS as a file system on the client's local file system. For more information and instructions, see Configuring an NFSv3 Gateway Using the Command Line.
  • HDFS-5709 - Improve upgrade with existing files and directories named .snapshot.
Major Bug Fixes:
  • HDFS-5449- Fix WebHDFS compatibility break.
  • HDFS-5671- Fix socket leak in DFSInputStream#getBlockReader.
  • HDFS-5353- Short circuit reads fail when dfs.encrypt.data.transfer is enabled.
  • HDFS-5438- Flaws in block report processing can cause data loss.
Changed Behavior:
  • As of CDH 5 Beta 2, in order for the NameNode to start up on a secure cluster, you should have the dfs.web.authentication.kerberos.principal property defined in hdfs-site.xml. This has been documented in the CDH 5 Security Guide. For clusters managed by Cloudera Manager, you do not need to explicitly define this property.
  • HDFS-5037 - Active NameNode should trigger its own edit log rolls.Clients will now retry for a configurable period when encountering a NameNode in Safe Mode.
  • The default behavior of the mkdir command has changed. As of CDH 5 Beta 2, if the parent folder does not exist, the -p switch must be explicitly mentioned otherwise the command fails.
MapReduce (MRv1 and YARN)
  • Fair Scheduler (in YARN and MRv1) now supports advance configuration to automatically place applications in queues.
  • MapReduce now supports running multiple reducers in uber mode and in local job runner.

Apache HBase

  • Online Schema Change is now a supported feature.
  • Online Region Merge is now a supported feature.
  • Namespaces: CDH 5 Beta 2 includes the namespaces feature which enables different sets of tables to be administered by different administrative users. All upgraded tables will live in the default "hbase" namespace. Administrators may create new namespaces and create tables users with rights to the namespace may administer permissions on the tables within the namespace.
  • There have been several improvements to HBase’s mean time to recovery (mttr) in the face of Master or RegionServer failures.

    • Distributed log splitting has matured, and is always activated. The option to use the old slower splitting mechanism no longer exists.
    • Failure detection time has been improved. New notifications are now sent when RegionServers or Masters fail which triggers corrective action quickly.
    • The Meta table has a dedicated write ahead log which enables faster recovery region recovery if the RegionServer serving meta goes down.
  • The Region Balancer has been significantly updated to take more load attributes into account.
  • Added TableSnapshotInputFormat and TableSnapshotScanner to perform scans over HBase table snapshots from the client side, bypassing the HBase servers. The former configures a MapReduce job, while the latter does a single client-side scan over snapshot files. Can also be used with offline HBase with in-place or exported snapshot files.
  • The KeyValue API has been deprecated for applications in favor of the Cell interface. Users upgrading to HBase 0.96 may still use KeyValue by future upgrades may remove the class or parts of its functionality. Users are encouraged to update their applications to use the new Cell interface.
  • Currently Experimental features:
    • Distributed log replay: This mechanism allows for faster recovery from RegionServer failures but has one special case where it will violate ACID guarantees. Cloudera does not currently recommend activating this feature.
    • Bucket cache: This is an offheap caching mechanism that use extra RAM and block devices (such as flash drives) to greatly increase the read caching capabilities provided by the BlockCache. Cloudera does not currently recommend activating this feature.
    • Favored nodes: This feature enables HBase to better control where its data is written to in HDFS in order to better preserve performance after a failure. This is disabled currently because it doesn’t interact well with the HBase Balancer or HDFS Balancer. Cloudera does not currently recommend activating this feature.
See this blog post for more details.

Apache Hive

New Features:

  • Improved JDBC specification coverage:
    • Improvements to getDatabaseMajorVersion(), getDatabaseMinorVersion() APIs (HIVE-3181)
    • Added JDBC support for new datatypes: Char (HIVE-5683), Decimal (HIVE-5355) and Varchar (HIVE-5209)
    • You can now specify the database for a session in the HiveServer2 connection URL (HIVE-4256)
  • Encrypted communication between the Hive Server and Clients. This includes TLS/SSL encryption for non-Kerberos connections to HiveServer2 (HIVE-5351).
  • A native Parquet SerDe is now available as part of the CDH 5 Beta 2 package. Users can directly create a Parquet format table without any external package dependency.
Changed Behavior:
  • HIVE-4256 - With Sentry enabled, the use <database> command is now executed as part of the connection to HiveServer2. Hence, a user with no privileges to access a database will not be allowed to connect to HiveServer2.

Hue

  • Hue has been upgraded to version 3.5.0.
  • Impala and Hive Editor are now one-page apps. The Editor, Progress, Table list and Results are all on the same page
  • Result graphing for the Hive and Impala Editors.
  • Editor and Dashboard for Oozie SLA, crontab and credentials.
  • The Sqoop2 app supports autocomplete of database and table names/fields.
  • DBQuery App: MySQL and PostgreSQL Query Editors.
  • New Search feature: Graphical facets
  • Integrate external Web applications in any language. See this blog post for more details.
  • Create Hive tables and load quoted CSV data. Tutorial available here.
  • Submit any Oozie jobs directly from HDFS. Tutorial available here
  • New SAML backend enables single sign-on (SSO) with Hue.

Apache Oozie

  • Oozie now supports cron-style scheduling capability.
  • Oozie now supports High Availability with security.

Apache Pig

  • AvroStorage rewritten for better performance, and moved from piggybank to core Pig
  • ASSERT, IN, and CASE operators added
  • ParquetStorage added for integration with Parquet

Cloudera Search

Apache Spark (incubating)

Spark is a fast, general engine for large-scale data processing. For installation and configuration instructions, see Setting Up Apache Spark Using the Command Line.

Apache Sqoop

Sqoop 2 has been upgraded from version 1.99.2 to 1.99.3.