What's New In CDH 5 Beta Releases
What's New in CDH 5 Beta 1
Oracle JDK 7 Support
- CDH 5 supports Oracle JDK 1.7 and supports users running applications compiled with JDK 1.7. For CDH 5 Beta 1 the certified version is JDK 1.7.0_25. Cloudera has tested this version across all components.
- CDH 5 does not support JDK 1.6; you must install JDK 1.7, as instructed here.
- HDFS-4953: Enable HDFS local reads via mmap.
- HDFS-2802: Support for RW/RO snapshots in HDFS. See: hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsNfsGateway.apt.vm
- HDFS-4750: Support NFSv3 interface to HDFS.
- HDFS-4817: Make HDFS advisory caching configurable on a per-file basis.
- HDFS-3601: Add BlockPlacementPolicyWithNodeGroup to support block placement with 4-layer network topology.
- HDFS-5122: Support failover and retry in WebHdfsFileSystem for NN HA.
- HDFS-4772 / HDFS-5043: Add number of children (of a directory) in HdfsFileStatus.
- HDFS-4434: Provide a mapping from INodeId to INode.See: /.reserved/.inodes/<INODE_NUMBER>
- HDFS-2576: Enhances the DistributedFileSystem's Create API so that clients can specify favored DataNodes for a file's blocks.
- HDFS-4659: Support setting execution bit for regular files.
- Impact: In CDH 5, files copied out of copyToLocal may now have the executable bit set if it was set when they were created or copied into HDFS.
- HDFS-4594: WebHDFS open sets Content-Length header to what is specified by length parameter
rather than how much data is actually returned.
- Impact: In CDH 5, Content-Length header will contain the number of bytes actually returned, rather than the request length.
MapReduce v2 (YARN)
- ResourceManager High Availability: YARN now allows you to use multiple ResourceManagers so that there is no single point of failure. In-flight jobs are recovered without re-running completed tasks.
- Monitoring and enforcing memory and CPU-based resource utilization using cgroups.
- Continuous Scheduling: This feature decouples scheduling from the node heartbeats for improved performance in large clusters.
- ResourceManager Restart: Persistent implementations of the RMStateStore (filesystem-based and ZooKeeper-based) allow recovery of in-flight jobs.
Summary of New Features
- Support for Hadoop 2.0
- Improved MTTR (meta first recovery, distributed log replay)
- Improved compatibility and upgradeability (ProtoBuf serialization format)
- Namespaces added for administrative domains
- Snapshots (ported to 0.94 / CDH4.2)
- Online region merge mechanisms added
- Major security and functional improvements made for the REST proxy server
- HBASE-8015: Added support for namespaces.
- HBASE-7590: “Costless” notifications from master to rs/clients.
- HBASE-7213 / HBASE-8631: New .meta suffix to separate HLog file / Recover Meta before other regions in case of server crash.
- HBASE-7006: Distributed log replay (Caveat).
- HBASE-9116: Adds a view/edit tool for favored node mappings for regions (incomplete, likely a dot version).
- HBASE-7403: HBase online region merge.
- Shell improvements; tables list to be more well-rounded.
- HBASE-5953: Expose the current state of the balancerSwitch.
- HBASE-5934: Add the ability for Performance Evaluation to set table compression.
- HBASE-6135: New Web UI.
- HBASE-8148: Allow IPC to bind to a specific address (also 0.94.7)
- HBASE-5498: Secure Bulk Load (also 0.94.5)
Backup and Disaster Recovery Features
- HBASE-8609: Add startRow-stopRow options to the CopyTable.
- HBASE-7702: Add filtering to import jobs.
The REST server now supports Hadoop authentication and authorization mechanisms. The Avro gateway has been removed while the Thrift2 proxy has made progress but is not complete. However, it has been included as a preview feature.
- HBASE-9347: Support for specifying filter in REST server requests.
- HBASE-7803: Support caching on scan.
- HBASE-7757: Add Web UI for Thrift and REST servers.
- HBASE-5050: SPNEGO-based authentication.
- HBASE-8661: Support REST over HTTPS.
- HBASE-8662: Support for impersonation.
- HBASE-7986: [REST] Make HTablePool size configurable.
- HBASE-5879: Enable JMX metrics collection for the Thrift proxy.
Thrift2: Ongoing efforts to match Thrift and REST functionality. (Incomplete, only a preview feature)
- HBASE-5948: Avro gateway removed.
There have been several bug fixes, test fixes and configuration default changes that greatly increase our confidence in the stability of the 0.96.0 release. The main improvement comes from the use of a systematic fault-injection framework.
- HBASE-5959: Added a Stochastic LoadBalancer
- HBASE-7842: Exploring compactor.
- HBASE-7236: Add per-table/per-cf configuration via metadata
- HBASE-8163: MemStoreChunkPool: Improvement for Java GC
- HBASE-4391HBASE-6567: Mlock / Memory locking improvements (less disk swap).
- HBASE-4391: Bucket cache (untested)
- HBASE-6870: Improvement to HTable coprocessorExec scan performance.
These features are to aid application developers or for major changes that will enable future minor version improvements.
- HBASE-9121: HTrace updates.
- HBASE-8375: Durability setting per table.
- HBASE-7801 Deferred sync for WAL logs (0.94.7 and later)
- HBASE-7897: Tags supported in cell interface (for future security features).
- HBASE-5937: Refactor HLog into interface (allows for new HLogs in 0.96.x).
- HBASE-4336: Modularization of POM / Multiple jars (many follow-ons, HBASE-7898).
- HBASE-8224: Publish -hadoop1 and -hadoop2 versioned jars to Maven (CDH published jars are assumed -hadoop2).
- HBASE-9164: Move towards Cell interface in client instead of KeyValue.
- HBASE-7898: Serializing cells over RPC.
- HBASE-7725: Add ability to create custom compaction request.
- With the Sqoop 2 application, data from databases can be easily exported or imported into HDFS in a scalable manner. The Job Wizard hides the complexity of creating Sqoop jobs and the dashboard offers live progress and log access.
- Zookeeper App: Navigate and browse the Znode hierarchy and content of a Zookeeper cluster. Znodes can be added, deleted and edited. Multi-clusters are supported and various statistics are available for them.
- The Hue Shell application has been removed and replaced by the Pig Editor, HBase Browser and the Sqoop 1 apps.
- Python 2.6 is required.
- Beeswax daemon has been replaced by HiveServer2.
- CDH 5 Hue will only work with HiveServer2 from CDH 5. No support for impersonation.
- [HUE-897] - [core] Redesign of the overall layout
- [HUE-1521] - [core] Improve JobTracker High Availability
- [HUE-1493] - [beeswax] Replace the Beeswax server with HiveServer2
- [HUE-1474] - [core] Upgrade Django backend version from 1.2 to 1.4
- [HUE-1506]- [search] Impersonation support added
- [HUE-1475] - [core] Switch back from the Spawning web server
- [HUE-917] - Support SAML based authentication to enable single sign-on (SSO)
Apache Hive and HCatalog
New Features (Updated to upstream version 0.11.0):
- [HIVE-446] - Implement TRUNCATE for table data
- [HIVE-896] - Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive
- [HIVE-2693] - Add DECIMAL data type
- [HIVE-3834] - Support ALTER VIEW AS SELECT in Hive
Performance improvements (from 0.12):
- [HIVE-3764] - Support metastore version consistency check
- [HIVE-305] - Port Hadoop streaming process's counters/status reporters to Hive Transforms
- [HIVE-1402] - Add parallel ORDER BY to Hive
- [HIVE-2206] - Add a new optimizer for query correlation discovery and optimization
- [HIVE-2517] - Support GROUP BY on struct type
- [HIVE-2655] - Ability to define functions in HQL
- [HIVE-4911] - Enable QOP configuration for HiveServer2 Thrift transport
Cloudera Impala 1.2.0 is now available as part of CDH 5. For more details on Impala, refer the Impala Documentation.
Llama is a system that mediates resource management between Cloudera Impala and Hadoop YARN. Llama enables Impala to reserve, use, and release resource allocations in a Hadoop cluster. Llama is only required if resource management is enabled in Impala.
See Managing the Impala Llama ApplicationMaster for more information.
New Features (Updated to Mahout 0.8):
- Numerous performance improvements to Vector and Matrix implementations, APIs and their iterators (see also MAHOUT-1192, MAHOUT-1202)
- Numerous performance improvements to the recommender implementations (see also MAHOUT-1272, MAHOUT-1035, MAHOUT-1042, MAHOUT-1151, MAHOUT-1166, MAHOUT-1167, MAHOUT-1169, MAHOUT-1205, MAHOUT-1264)
- MAHOUT-1088: Support for biased item-based recommender.
- MAHOUT-1089: SGD matrix factorization for rating prediction with user and item biases.
- MAHOUT-1106: Support for SVD++
- MAHOUT-944: Support for converting one or more Lucene storage indexes to SequenceFiles as well as an upgrade of the supported Lucene version to Lucene 4.3.
- MAHOUT-1154 and related: New streaming k-means implementation that offers online (and fast) clustering.
- MAHOUT-833: Make conversion to SequenceFiles Map-Reduce. 'seqdirectory' can now be run as a MapReduce job.
- MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values).
- MAHOUT-884: Matrix concatenate utility; presently only concatenates two matrices.
- Updated to Oozie 4.0.0.
- High Availability: Multiple Oozie servers can now be utilized to provide an HA Oozie service as well as provide horizontal scalability. See upstream documentation for more details.
- HCatalog Integration: HCatalog table partitions can now be used as data dependencies in coordinators. See upstream documentation for more details. .
- SLA Monitoring: Oozie can now actively monitor SLA-sensitive jobs and send out notifications for SLA meets and misses. SLA information is also now available through a new SLA tab in the Oozie Web UI, JMS messages, and a REST API. See upstream documentation.
- JMS Notifications: Oozie can now publish notifications to a JMS Provider about job status changes and SLA events. See upstream documentation.
- The FileSystem action can now use glob patterns for file paths when doing move, delete, chmod, and chgrp.
Cloudera Search 1.0.0 is now available as part of CDH 5. For more details on Search see the Search documentation.
The Cloudera Development Kit (CDK) is a set of libraries and tools that can be used with Search and other CDH components to build jobs/systems on top of the Hadoop ecosystem. See the CDK Documentation and Release Notes for more details.
Apache Sentry (incubating)
CDH 5 Beta 1 includes the first upstream release of Apache Sentry, sentry-1.2.0-incubating.
CDH 5 Sqoop 1 has been rebased on Apache Sqoop 1.4.4.
What's New in CDH 5 Beta 2
New Features and Changes in CDH 5 Beta 2
The Apache Crunch™ project develops and supports Java APIs that simplify the process of creating data pipelines on top of Apache Hadoop. The Crunch APIs are modeled after FlumeJava (PDF), which is the library that Google uses for building data pipelines on top of their own implementation of MapReduce. For more information and installation instructions, see Crunch Installation.
- Upgraded from version 0.4 to 1.1.0 (this upgrade is not backward compatible).
- New features include UDFS SHA, SimpleRandomSample, COALESCE, ReservoirSample, EmptyBagToNullFields, and many others.
- FLUME-2294 - Added a new sink to write Kite datasets.
- FLUME-2056 - Spooling Directory Source can now only pass the name of the file in the event headers.
- FLUME-2155 - File Channel is indexed during replay to improve replay performance for faster startup.
- FLUME-2217 - Syslog Sources can optionally preserve all syslog headers in the message body.
- FLUME-2052 - Spooling Directory Source can now replace or ignore malformed characters in input files.
- As of CDH 5 Beta 2, you can upgrade HDFS with high availability (HA) enabled, if you are using Quorum-based storage. (Quorum-based storage is the only method available in CDH 5; NFS shared storage is not supported.) For upgrade instructions, see Upgrading from CDH 4 to CDH 5.
- HDFS-4949 - CDH 5 Beta 2 supports HDFS caching. For more information, see Configuring Centralized Cache Management in HDFS.
- As of CDH 5 Beta 2, you can configure an NFSv3 gateway that allows any NFSv3-compatible client to mount HDFS as a file system on the client's local file system. For more information and instructions, see Configuring an NFSv3 Gateway Using the Command Line.
- HDFS-5709 - Improve upgrade with existing files and directories named .snapshot.
- As of CDH 5 Beta 2, in order for the NameNode to start up on a secure cluster, you should have the dfs.web.authentication.kerberos.principal property defined in hdfs-site.xml. This has been documented in the CDH 5 Security Guide. For clusters managed by Cloudera Manager, you do not need to explicitly define this property.
- HDFS-5037 - Active NameNode should trigger its own edit log rolls.Clients will now retry for a configurable period when encountering a NameNode in Safe Mode.
- The default behavior of the mkdir command has changed. As of CDH 5 Beta 2, if the parent folder does not exist, the -p switch must be explicitly mentioned otherwise the command fails.
MapReduce (MRv1 and YARN)
- Fair Scheduler (in YARN and MRv1) now supports advance configuration to automatically place applications in queues.
- MapReduce now supports running multiple reducers in uber mode and in local job runner.
- Online Schema Change is now a supported feature.
- Online Region Merge is now a supported feature.
- Namespaces: CDH 5 Beta 2 includes the namespaces feature which enables different sets of tables to be administered by different administrative users. All upgraded tables will live in the default "hbase" namespace. Administrators may create new namespaces and create tables users with rights to the namespace may administer permissions on the tables within the namespace.
There have been several improvements to HBase’s mean time to recovery (mttr) in the face of Master or RegionServer failures.
- Distributed log splitting has matured, and is always activated. The option to use the old slower splitting mechanism no longer exists.
- Failure detection time has been improved. New notifications are now sent when RegionServers or Masters fail which triggers corrective action quickly.
- The Meta table has a dedicated write ahead log which enables faster recovery region recovery if the RegionServer serving meta goes down.
- The Region Balancer has been significantly updated to take more load attributes into account.
- Added TableSnapshotInputFormat and TableSnapshotScanner to perform scans over HBase table snapshots from the client side, bypassing the HBase servers. The former configures a MapReduce job, while the latter does a single client-side scan over snapshot files. Can also be used with offline HBase with in-place or exported snapshot files.
- The KeyValue API has been deprecated for applications in favor of the Cell interface. Users upgrading to HBase 0.96 may still use KeyValue by future upgrades may remove the class or parts of its functionality. Users are encouraged to update their applications to use the new Cell interface.
- Currently Experimental features:
- Distributed log replay: This mechanism allows for faster recovery from RegionServer failures but has one special case where it will violate ACID guarantees. Cloudera does not currently recommend activating this feature.
- Bucket cache: This is an offheap caching mechanism that use extra RAM and block devices (such as flash drives) to greatly increase the read caching capabilities provided by the BlockCache. Cloudera does not currently recommend activating this feature.
- Favored nodes: This feature enables HBase to better control where its data is written to in HDFS in order to better preserve performance after a failure. This is disabled currently because it doesn’t interact well with the HBase Balancer or HDFS Balancer. Cloudera does not currently recommend activating this feature.
- Improved JDBC specification coverage:
- Encrypted communication between the Hive Server and Clients. This includes TLS/SSL encryption for non-Kerberos connections to HiveServer2 (HIVE-5351).
- A native Parquet SerDe is now available as part of the CDH 5 Beta 2 package. Users can directly create a Parquet format table without any external package dependency.
- HIVE-4256 - With Sentry enabled, the use <database> command is now executed as part of the connection to HiveServer2. Hence, a user with no privileges to access a database will not be allowed to connect to HiveServer2.
- Hue has been upgraded to version 3.5.0.
- Impala and Hive Editor are now one-page apps. The Editor, Progress, Table list and Results are all on the same page
- Result graphing for the Hive and Impala Editors.
- Editor and Dashboard for Oozie SLA, crontab and credentials.
- The Sqoop2 app supports autocomplete of database and table names/fields.
- DBQuery App: MySQL and PostgreSQL Query Editors.
- New Search feature: Graphical facets
- Integrate external Web applications in any language. See this blog post for more details.
- Create Hive tables and load quoted CSV data. Tutorial available here.
- Submit any Oozie jobs directly from HDFS. Tutorial available here
- New SAML backend enables single sign-on (SSO) with Hue.
- Oozie now supports cron-style scheduling capability.
- Oozie now supports High Availability with security.
- AvroStorage rewritten for better performance, and moved from piggybank to core Pig
- ASSERT, IN, and CASE operators added
- ParquetStorage added for integration with Parquet
Apache Spark (incubating)
Spark is a fast, general engine for large-scale data processing. For installation and configuration instructions, see Spark Installation.
Sqoop 2 has been upgraded from version 1.99.2 to 1.99.3.