This is the documentation for CDH 5.1.x. Documentation for other versions is available at Cloudera Documentation.

What's New in CDH 5 Beta 1

Oracle JDK 7 Support

  • CDH 5 supports Oracle JDK 1.7 and supports users running applications compiled with JDK 1.7. For CDH 5 Beta 1 the certified version is JDK 1.7.0_25. Cloudera has tested this version across all components.
  • CDH 5 does not support JDK 1.6; you must install JDK 1.7, as instructed here.

Apache Flume

New Features:
  • FLUME-2190 - Includes a new Twitter Source that feeds off the Twitter firehose
  • FLUME-2109 - HTTP Source now supports HTTPS
  • Flume now auto-detects Cloudera Search dependencies.

Apache Hadoop

HDFS

New Features:
  • HDFS-4953: Enable HDFS local reads via mmap.
  • HDFS-2802: Support for RW/RO snapshots in HDFS. See: hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsNfsGateway.apt.vm
  • HDFS-4750: Support NFSv3 interface to HDFS.
  • HDFS-4817: Make HDFS advisory caching configurable on a per-file basis.
  • HDFS-3601: Add BlockPlacementPolicyWithNodeGroup to support block placement with 4-layer network topology.
  • HDFS-5122: Support failover and retry in WebHdfsFileSystem for NN HA.
  • HDFS-4772 / HDFS-5043: Add number of children (of a directory) in HdfsFileStatus.
  • HDFS-4434: Provide a mapping from INodeId to INode. See: /.reserved/.inodes/<INODE_NUMBER>
  • HDFS-2576: Enhances the DistributedFileSystem's Create API so that clients can specify favored DataNodes for a file's blocks.
Changed Features:
  • HDFS-4659: Support setting execution bit for regular files.
    • Impact: In CDH 5, files copied out of copyToLocal may now have the executable bit set if it was set when they were created or copied into HDFS.
  • HDFS-4594: WebHDFS open sets Content-Length header to what is specified by length parameter rather than how much data is actually returned.
    • Impact: In CDH 5, Content-Length header will contain the number of bytes actually returned, rather than the request length.
Changed Behavior:
  • HDFS-4645: Move from randomly generated block ID to sequentially generated block ID.
  • HDFS-4451: HDFS balancer command returns exit code 1 on success instead of 0.

MapReduce v2 (YARN)

New Features:
  • ResourceManager High Availability: YARN now allows you to use multiple ResourceManagers so that there is no single point of failure. In-flight jobs are recovered without re-running completed tasks.
  • Monitoring and enforcing memory and CPU-based resource utilization using cgroups.
  • Continuous Scheduling: This feature decouples scheduling from the node heartbeats for improved performance in large clusters.
Changed Feature:
  • ResourceManager Restart: Persistent implementations of the RMStateStore (filesystem-based and ZooKeeper-based) allow recovery of in-flight jobs.

Apache HBase

Summary of New Features

  • Support for Hadoop 2.0
  • Improved MTTR (meta first recovery, distributed log replay)
  • Improved compatibility and upgradeability (ProtoBuf serialization format)
  • Namespaces added for administrative domains
  • Snapshots (ported to 0.94 / CDH4.2)
  • Online region merge mechanisms added
  • Major security and functional improvements made for the REST proxy server

Administrative Features

ProtoBuf: All of the serialization that goes across the wire between servers written to and read by HBase file formats have been converted to extensible Protobuf encodings. This breaks compatibility with previous versions but should make future extensions less likely to break compatibility in these areas. This feature is enabled by default.
  • HBASE-5305: Improve cross-version compatibility and upgradeability.
  • HBASE-7898: Serializing cells over RPC.
Namespaces: Namespaces is a new feature that groups tables into different administrative domains. An admin can be only given rights to act upon a particular namespace. This feature is enabled by default and requires file system layout changes that must be completed during upgrade.
MTTR Improvements: Mean time to recovery has greatly improved.
  • HBASE-7590: “Costless” notifications from master to rs/clients.
  • HBASE-7213 / HBASE-8631: New .meta suffix to separate HLog file / Recover Meta before other regions in case of server crash.
  • HBASE-7006: Distributed log replay (Caveat).
  • HBASE-9116: Adds a view/edit tool for favored node mappings for regions (incomplete, likely a dot version).
Metrics: There are several new metrics and a new naming convention for metrics in HBase. This also includes metrics for each region.
  • HBASE-3614: Per region metrics.
  • HBASE-4050: Rationalize metrics; Update HBase metrics framework to metrics2.
Miscellaneous:
  • HBASE-7403: HBase online region merge.
  • Shell improvements; tables list to be more well-rounded.
  • HBASE-5953: Expose the current state of the balancerSwitch.
  • HBASE-5934: Add the ability for Performance Evaluation to set table compression.
  • HBASE-6135: New Web UI.
  • HBASE-8148: Allow IPC to bind to a specific address (also 0.94.7)
  • HBASE-5498: Secure Bulk Load (also 0.94.5)

Backup and Disaster Recovery Features

Replication: Several critical bug fixes.
  • HBASE-9373: Replication has been hardened.
  • HBASE-9158: Serious bug in cyclic replication.
  • HBASE-8737: Changes to the replication RPC to use cell blocks.
Snapshots: HBase table snaphots were backported to 0.94.x. There are some incompatibilities between the implementation released in CDH 4 with that in CDH 5.
  • HBASE-7290: Online snapshots (backported to 0.94.x).
  • HBASE-8352: Rename snapshots folder from .snapshot to .hbase-snapshots (Incompatible change).
Copy table:
  • HBASE-8609: Add startRow-stopRow options to the CopyTable.
Import:

HBase Proxies

The REST server now supports Hadoop authentication and authorization mechanisms. The Avro gateway has been removed while the Thrift2 proxy has made progress but is not complete. However, it has been included as a preview feature.

REST:
Thrift:
  • HBASE-5879: Enable JMX metrics collection for the Thrift proxy.

Thrift2: Ongoing efforts to match Thrift and REST functionality. (Incomplete, only a preview feature)

Avro:

Stability Features

There have been several bug fixes, test fixes and configuration default changes that greatly increase our confidence in the stability of the 0.96.0 release. The main improvement comes from the use of a systematic fault-injection framework.

Performance Features

Several features have been added to improve throughput and performance characteristics of HBase and its clients.
  Warning:

Currently the 0.95.2/CDH 5 beta 1 release will suffer performance degradation when over 40 nodes are used when compared to CDH 4.

Throughput:
Predictable Performance:
Miscellaneous:
  • HBASE-6870: Improvement to HTable coprocessorExec scan performance.

Developer Features

These features are to aid application developers or for major changes that will enable future minor version improvements.

  • HBASE-9121: HTrace updates.
  • HBASE-8375: Durability setting per table.
    • HBASE-7801 Deferred sync for WAL logs (0.94.7 and later)
  • HBASE-7897: Tags supported in cell interface (for future security features).
  • HBASE-5937: Refactor HLog into interface (allows for new HLogs in 0.96.x).
  • HBASE-4336: Modularization of POM / Mulitple jars (many follow-ons, HBASE-7898).
  • HBASE-8224: Publish -hadoop1 and -hadoop2 versioned jars to Maven (CDH published jars are assumed -hadoop2).
  • HBASE-9164: Move towards Cell interface in client instead of KeyValue.
  • HBASE-7898: Serializing cells over RPC.
  • HBASE-7725: Add ability to create custom compaction request.

Hue

New Features:
  • With the Sqoop 2 application, data from databases can be easily exported or imported into HDFS in a scalable manner. The Job Wizard hides the complexity of creating Sqoop jobs and the dashboard offers live progress and log access.
  • Zookeeper App: Navigate and browse the Znode hierarchy and content of a Zookeeper cluster. Znodes can be added, deleted and edited. Multi-clusters are supported and various statistics are available for them.
  • The Hue Shell application has been removed and replaced by the Pig Editor, HBase Browser and the Sqoop 1 apps.
  • Python 2.6 is required.
  • Beeswax daemon has been replaced by HiveServer2.
  • CDH 5 Hue will only work with HiveServer2 from CDH 5. No support for impersonation.
Hue also includes the following changed features (Updated to upstream version 3.0.0):
  • [HUE-897] - [core] Redesign of the overall layout
  • [HUE-1521] - [core] Improve JobTracker High Availability
  • [HUE-1493] - [beeswax] Replace the Beeswax server with HiveServer2
  • [HUE-1474] - [core] Upgrade Django backend version from 1.2 to 1.4
  • [HUE-1506]- [search] Impersonation support added
  • [HUE-1475] - [core] Switch back from the Spawning web server
  • [HUE-917] - Support SAML based authentication to enable single sign-on (SSO)
From master:
  • [HUE-950] - [core] Improvements to the document model
  • [HUE-1595] - Integrate Metastore data into Hive and Impala Query UIs
  • [HUE-1275] - [metastore] Show Metastore table details
  • [HUE-1622] - [core] Mini tour added to Hue home page

Apache Hive and HCatalog

New Features (Updated to upstream version 0.11.0):

  • [HIVE-446] - Implement TRUNCATE for table data
  • [HIVE-896] - Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive
  • [HIVE-2693] - Add DECIMAL data type
  • [HIVE-3834] - Support ALTER VIEW AS SELECT in Hive

Performance improvements (from 0.12):

  • [HIVE-3764] - Support metastore version consistency check
  • [HIVE-305] - Port Hadoop streaming process's counters/status reporters to Hive Transforms
  • [HIVE-1402] - Add parallel ORDER BY to Hive
  • [HIVE-2206] - Add a new optimizer for query correlation discovery and optimization
  • [HIVE-2517] - Support GROUP BY on struct type
  • [HIVE-2655] - Ability to define functions in HQL
  • [HIVE-4911] - Enable QOP configuration for HiveServer2 Thrift transport

Cloudera Impala

Cloudera Impala 1.2.0 is now available as part of CDH 5. For more details on Impala, refer the Impala Documentation.

The Release Notes for Cloudera Impala are available here.

Llama

Llama is a system that mediates resource management between Cloudera Impala and Hadoop YARN. Llama enables Impala to reserve, use, and release resource allocations in a Hadoop cluster. Llama is only required if resource management is enabled in Impala.

See Llama Installation and Administering Impala for more details on installing and deploying Llama.

Apache Mahout

New Features (Updated to Mahout 0.8):

  • Numerous performance improvements to Vector and Matrix implementations, APIs and their iterators (see also MAHOUT-1192, MAHOUT-1202)
  • Numerous performance improvements to the recommender implementations (see also MAHOUT-1272, MAHOUT-1035, MAHOUT-1042, MAHOUT-1151, MAHOUT-1166, MAHOUT-1167, MAHOUT-1169, MAHOUT-1205, MAHOUT-1264)
  • MAHOUT-1088: Support for biased item-based recommender.
  • MAHOUT-1089: SGD matrix factorization for rating prediction with user and item biases.
  • MAHOUT-1106: Support for SVD++
  • MAHOUT-944: Support for converting one or more Lucene storage indexes to SequenceFiles as well as an upgrade of the supported Lucene version to Lucene 4.3.
  • MAHOUT-1154 and related: New streaming k-means implementation that offers online (and fast) clustering.
  • MAHOUT-833: Make conversion to SequenceFiles Map-Reduce. 'seqdirectory' can now be run as a MapReduce job.
  • MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values).
  • MAHOUT-884: Matrix concatenate utility; presently only concatenates two matrices.

Apache Oozie

New Features:

  • Updated to Oozie 4.0.0.
  • High Availability: Multiple Oozie servers can now be utilized to provide an HA Oozie service as well as provide horizontal scalability. See upstream documentation for more details.
  • HCatalog Integration: HCatalog table partitions can now be used as data dependencies in coordinators. See upstream documentation for more details. .
  • SLA Monitoring: Oozie can now actively monitor SLA-sensitive jobs and send out notifications for SLA meets and misses. SLA information is also now available through a new SLA tab in the Oozie Web UI, JMS messages, and a REST API. See upstream documentation.
  • JMS Notifications: Oozie can now publish notifications to a JMS Provider about job status changes and SLA events. See upstream documentation.
  • The FileSystem action can now use glob patterns for file paths when doing move, delete, chmod, and chgrp.

Cloudera Search

Cloudera Search 1.0.0 is now available as part of CDH 5. For more details on Search see the Search documentation.

The Cloudera Development Kit (CDK) is a set of libraries and tools that can be used with Search and other CDH components to build jobs/systems on top of the Hadoop ecosystem. See the CDK Documentation and Release Notes for more details.

  Note: An existing dependency, Apache Tika, has been upgraded to version 1.4.

Apache Sentry (incubating)

CDH 5 Beta 1 includes the first upstream release of Apache Sentry, sentry-1.2.0-incubating.

Apache Sqoop

CDH 5 Sqoop 1 has been rebased on Apache Sqoop 1.4.4.

Page generated September 3, 2015.