What's New in CDH 5 Beta 1
Oracle JDK 7 Support
- CDH 5 supports Oracle JDK 1.7 and supports users running applications compiled with JDK 1.7. For CDH 5 Beta 1 the certified version is JDK 1.7.0_25. Cloudera has tested this version across all components.
- CDH 5 does not support JDK 1.6; you must install JDK 1.7, as instructed here.
Summary of New Features
- Support for Hadoop 2.0
- Improved MTTR (meta first recovery, distributed log replay)
- Improved compatibility and upgradeability (ProtoBuf serialization format)
- Namespaces added for administrative domains
- Snapshots (ported to 0.94 / CDH4.2)
- Online region merge mechanisms added
- Major security and functional improvements made for the REST proxy server
- HBASE-8015: Added support for namespaces.
- HBASE-7590: “Costless” notifications from master to rs/clients.
- HBASE-7213 / HBASE-8631: New .meta suffix to separate HLog file / Recover Meta before other regions in case of server crash.
- HBASE-7006: Distributed log replay (Caveat).
- HBASE-9116: Adds a view/edit tool for favored node mappings for regions (incomplete, likely a dot version).
- HBASE-7403: HBase online region merge.
- Shell improvements; tables list to be more well-rounded.
- HBASE-5953: Expose the current state of the balancerSwitch.
- HBASE-5934: Add the ability for Performance Evaluation to set table compression.
- HBASE-6135: New Web UI.
- HBASE-8148: Allow IPC to bind to a specific address (also 0.94.7)
- HBASE-5498: Secure Bulk Load (also 0.94.5)
Backup and Disaster Recovery Features
- HBASE-8609: Add startRow-stopRow options to the CopyTable.
- HBASE-7702: Add filtering to import jobs.
The REST server now supports Hadoop authentication and authorization mechanisms. The Avro gateway has been removed while the Thrift2 proxy has made progress but is not complete. However, it has been included as a preview feature.
- HBASE-9347: Support for specifying filter in REST server requests.
- HBASE-7803: Support caching on scan.
- HBASE-7757: Add Web UI for Thrift and REST servers.
- HBASE-5050: SPNEGO-based authentication.
- HBASE-8661: Support REST over HTTPS.
- HBASE-8662: Support for impersonation.
- HBASE-7986: [REST] Make HTablePool size configurable.
- HBASE-5879: Enable JMX metrics collection for the Thrift proxy.
Thrift2: Ongoing efforts to match Thrift and REST functionality. (Incomplete, only a preview feature)
- HBASE-5948: Avro gateway removed.
There have been several bug fixes, test fixes and configuration default changes that greatly increase our confidence in the stability of the 0.96.0 release. The main improvement comes from the use of a systematic fault-injection framework.
Currently the 0.95.2/CDH 5 beta 1 release will suffer performance degradation when over 40 nodes are used when compared to CDH 4.
- HBASE-5959: Added a Stochastic LoadBalancer
- HBASE-7842: Exploring compactor.
- HBASE-7236: Add per-table/per-cf configuration via metadata
- HBASE-8163: MemStoreChunkPool: Improvement for Java GC
- HBASE-4391HBASE-6567: Mlock / Memory locking improvements (less disk swap).
- HBASE-4391: Bucket cache (untested)
- HBASE-6870: Improvement to HTable coprocessorExec scan performance.
These features are to aid application developers or for major changes that will enable future minor version improvements.
- HBASE-9121: HTrace updates.
- HBASE-8375: Durability
setting per table.
- HBASE-7801 Deferred sync for WAL logs (0.94.7 and later)
- HBASE-7897: Tags supported in cell interface (for future security features).
- HBASE-5937: Refactor HLog into interface (allows for new HLogs in 0.96.x).
- HBASE-4336: Modularization of POM / Mulitple jars (many follow-ons, HBASE-7898).
- HBASE-8224: Publish -hadoop1 and -hadoop2 versioned jars to Maven (CDH published jars are assumed -hadoop2).
- HBASE-9164: Move towards Cell interface in client instead of KeyValue.
- HBASE-7898: Serializing cells over RPC.
- HBASE-7725: Add ability to create custom compaction request.
- HDFS-4953: Enable HDFS local reads via mmap.
- HDFS-2802: Support for RW/RO snapshots in HDFS. See: hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsNfsGateway.apt.vm
- HDFS-4750: Support NFSv3 interface to HDFS.
- HDFS-4817: Make HDFS advisory caching configurable on a per-file basis.
- HDFS-3601: Add BlockPlacementPolicyWithNodeGroup to support block placement with 4-layer network topology.
- HDFS-5122: Support failover and retry in WebHdfsFileSystem for NN HA.
- HDFS-4772 / HDFS-5043: Add number of children (of a directory) in HdfsFileStatus.
- HDFS-4434: Provide a mapping from INodeId to INode. See: /.reserved/.inodes/<INODE_NUMBER>
- HDFS-2576: Enhances the DistributedFileSystem's Create API so that clients can specify favored DataNodes for a file's blocks.
- HDFS-4659: Support setting execution bit for regular files.
- Impact: In CDH 5, files copied out of copyToLocal may now have the executable bit set if it was set when they were created or copied into HDFS.
- HDFS-4594: WebHDFS open sets Content-Length header to what is
specified by length parameter rather than how much data is actually returned.
- Impact: In CDH 5, Content-Length header will contain the number of bytes actually returned, rather than the request length.
- With the Sqoop 2 application, data from databases can be easily exported or imported into HDFS in a scalable manner. The Job Wizard hides the complexity of creating Sqoop jobs and the dashboard offers live progress and log access.
- Zookeeper App: Navigate and browse the Znode hierarchy and content of a Zookeeper cluster. Znodes can be added, deleted and edited. Multi-clusters are supported and various statistics are available for them.
- The Hue Shell application has been removed and replaced by the Pig Editor, HBase Browser and the Sqoop apps.
- Python 2.6 is required.
- Beeswax daemon has been replaced by HiveServer2.
- CDH 5 Hue will only work with HiveServer2 from CDH 5. No support for impersonation.
- [HUE-897] - [core] Redesign of the overall layout
- [HUE-1521] - [core] Improve JobTracker High Availability
- [HUE-1493] - [beeswax] Replace the Beeswax server with HiveServer2
- [HUE-1474] - [core] Upgrade Django backend version from 1.2 to 1.4
- [HUE-1506]- [search] Impersonation support added
- [HUE-1475] - [core] Switch back from the Spawning web server
- [HUE-917] - Support SAML based authentication to enable single sign-on (SSO)
Apache Hive and HCatalog
New Features (Updated to upstream version 0.11.0):
- [HIVE-446] - Implement TRUNCATE for table data
- [HIVE-896] - Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive
- [HIVE-2693] - Add DECIMAL data type
- [HIVE-3834] - Support ALTER VIEW AS SELECT in Hive
Performance improvements (from 0.12):
- [HIVE-3764] - Support metastore version consistency check
- [HIVE-305] - Port Hadoop streaming process's counters/status reporters to Hive Transforms
- [HIVE-1402] - Add parallel ORDER BY to Hive
- [HIVE-2206] - Add a new optimizer for query correlation discovery and optimization
- [HIVE-2517] - Support GROUP BY on struct type
- [HIVE-2655] - Ability to define functions in HQL
- [HIVE-4911] - Enable QOP configuration for HiveServer2 Thrift transport
Llama is a system that mediates resource management between Cloudera Impala and Hadoop YARN. Llama enables Impala to reserve, use, and release resource allocations in a Hadoop cluster. Llama is only required if resource management is enabled in Impala.
New Features (Updated to Mahout 0.8):
- Numerous performance improvements to Vector and Matrix implementations, APIs and their iterators (see also MAHOUT-1192, MAHOUT-1202)
- Numerous performance improvements to the recommender implementations (see also MAHOUT-1272, MAHOUT-1035, MAHOUT-1042, MAHOUT-1151, MAHOUT-1166, MAHOUT-1167, MAHOUT-1169, MAHOUT-1205, MAHOUT-1264)
- MAHOUT-1088: Support for biased item-based recommender.
- MAHOUT-1089: SGD matrix factorization for rating prediction with user and item biases.
- MAHOUT-1106: Support for SVD++
- MAHOUT-944: Support for converting one or more Lucene storage indexes to SequenceFiles as well as an upgrade of the supported Lucene version to Lucene 4.3.
- MAHOUT-1154 and related: New streaming k-means implementation that offers online (and fast) clustering.
- MAHOUT-833: Make conversion to SequenceFiles Map-Reduce. 'seqdirectory' can now be run as a MapReduce job.
- MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values).
- MAHOUT-884: Matrix concatenate utility; presently only concatenates two matrices.
Apache MapReduce 2.0 (YARN)
- ResourceManager High Availability: YARN now allows you to use multiple ResourceManagers so that there is no single point of failure. In-flight jobs are recovered without re-running completed tasks.
- Monitoring and enforcing memory and CPU-based resource utilization using cgroups.
- Continuous Scheduling: This feature decouples scheduling from the node heartbeats for improved performance in large clusters.
- ResourceManager Restart: Persistent implementations of the RMStateStore (filesystem-based and ZooKeeper-based) allow recovery of in-flight jobs.
- Updated to Oozie 4.0.0.
- High Availability: Multiple Oozie servers can now be utilized to provide an HA Oozie service as well as provide horizontal scalability. See upstream documentation for more details.
- HCatalog Integration: HCatalog table partitions can now be used as data dependencies in coordinators. See upstream documentation for more details. .
- SLA Monitoring: Oozie can now actively monitor SLA-sensitive jobs and send out notifications for SLA meets and misses. SLA information is also now available through a new SLA tab in the Oozie Web UI, JMS messages, and a REST API. See upstream documentation.
- JMS Notifications: Oozie can now publish notifications to a JMS Provider about job status changes and SLA events. See upstream documentation.
- The FileSystem action can now use glob patterns for file paths when doing move, delete, chmod, and chgrp.
Cloudera Search 1.0.0 is now available as part of CDH 5. For more details on Search see the Search documentation.
The Cloudera Development Kit (CDK) is a set of libraries and tools that can be used with Search and other CDH components to build jobs/systems on top of the Hadoop ecosystem. See the CDK Documentation and Release Notes for more details.
Apache Sentry (incubating)
CDH 5 Beta 1 includes the first upstream release of Apache Sentry, sentry-1.2.0-incubating.
CDH 5 Sqoop has been rebased on Apache Sqoop 1.4.4.