What's New In CDH 5.4.x

Continue reading:

What's New in CDH 5.4.0
What's New in CDH 5.4.1
What's New in CDH 5.4.2
What's New in CDH 5.4.3
What's New in CDH 5.4.4
What's New in CDH 5.4.5
What's New in CDH 5.4.7
What's New in CDH 5.4.8
What's New in CDH 5.4.9
What's New in CDH 5.4.10
What's New in CDH 5.4.11

What's New in CDH 5.4.0

The following topics describe new features introduced in CDH 5.4.0. See also Issues Fixed in CDH 5.4.0.

Operating System Support
Security
Apache Crunch
Apache Flume
Apache Hadoop
Apache HBase
Apache Hive
Hue
Kite
Apache Oozie
Apache Parquet
Cloudera Search
Apache Spark
Apache Sqoop

For the latest Impala features, see New Features in Impala 2.2.x / CDH 5.4.x.

Operating System Support

CDH 5.4.0 adds support for RHEL and CentOS 6.6. See CDH and Cloudera Manager Supported Operating Systems.

Security

The following summarizes new security capabilities in CDH 5.4.0:

Secure Hue impersonation support for the Hue HBase application.
Redaction of sensitive data from logs, centrally managed by Cloudera Manager, which prevents the WHERE clause in queries from leaking sensitive data into logs and management UIs.
Cloudera Manager support for custom Kerberos principals.
Kerberos support for Sqoop 2.
Kerberos and TLS/SSL support for Flume Thrift source and sink.
Navigator SAML support (requires Cloudera Manager).
Navigator Key Trustee can now be installed and monitored by Cloudera Manager.
Search can be configured to use SSL.
Search supports protecting Solr and Lily HBase Indexer metadata using ZooKeeper ACLs in a Kerberos-enabled environment.

Apache Crunch

New HBase-related features:

HBaseTypes.cells() was added to support serializing HBase Cell objects.
All of the HFileUtils methods now support PCollectionC extends Cell, which includes both PCollectionKeyValue and PCollectionCell, on their method signatures.
HFileTarget, HBaseTarget, and HBaseSourceTarget all support any subclass of Cell as an output type. HFileSource and HBaseSourceTarget still return KeyValue as the input type for backward compatibility with existing Crunch pipelines.

Developers can use Cell-based APIs in the same way as KeyValue-based APIs if they are not ready to update their code, but will probably have to change code inside DoFns because HBase 0.99 and later APIs deprecated or removed a number of methods from the HBase 0.96 API.

Apache Flume

CDH 5.4.0 adds SSL and Kerberos support for the Thrift source and sink, and implements DatasetSink 2.0.

Apache Hadoop

HDFS

CDH 5.4.0 implements HDFS 2.6.0.
CDH 5.4.0 HDFS provides hot-swap capability for DataNode disk drives. You can add or replace HDFS data volumes without shutting down the DataNode host (HDFS-1362); see Performing Disk Hot Swap for DataNodes.
CDH 5.4.0 introduces cluster-wide redaction of sensitive data in logs and SQL queries. See Sensitive Data Redaction.
CDH 5.4.0 adds support for Heterogenous Storage Policies.
HDFS 2.6.0+ supports the option to configure AES encryption for block data transfer, using the property dfs.encrypt.data.transfer.algorithm. AES offers improved cryptographic strength and performance over the prior options of 3DES and RC4.

MapReduce

CDH 5.4.0 implements MAPREDUCE-5785, which simplifies MapReduce job configuration. Instead of having to set both the heap size (mapreduce.map.java.opts or mapreduce.reduce.java.opts) and the container size (mapreduce.map.memory.mb or mapreduce.reduce.memory.mb), you can now choose to set only one of them; the other is inferred from mapreduce.job.heap.memory-mb.ratio. If you do not specify either of them, the container size defaults to 1 GB and the heap size is inferred.

For jobs that do not set the heap size, the JVM size increases from 200 MB to a default 820 MB. This is adequate for most jobs, but streaming tasks might need more memory because the Java process causes total usage to exceed the container size. This typically occurs only for those tasks relying on aggressive garbage collection to keep the heap under 200 MB.

YARN

YARN-2990 improves application launch time by 6 seconds when using FairScheduler (with the default Cloudera Manager settings shown in YARN (MR2 Included) Properties in CDH 5.4.0).

Apache HBase

CDH 5.4.0 implements HBase 1.0.

MultiWAL Support for HBase

CDH 5.4.0 introduces MultiWAL support for HBase region servers, allowing you to increase throughput when a region writes the write-ahead log (WAL). See Configuring HBase MultiWAL Support.

doAs Impersonation for HBase

CDH 5.4.0 introduces doAs impersonation for the HBase Thrift server. doAs impersonation allows a client to authenticate to HBase as any user, and re-authenticate at any time, instead of as a static user only.

Read Replicas for HBase

CDH 5.4.0 introduces read replicas, along with a new timeline consistency model. This feature allows you to balance consistency and availability on a per-read basis, and provides a measure of high availability for reads if a RegionServer becomes unavailable. See HBase Read Replicas.

Storing Medium Objects (MOBs) in HBase

CDH 5.4.0 HBase MOB allows you to store objects up to 10 MB (medium objects, or MOBs) directly in HBase while maintaining read and write performance. See Storing Medium Objects (MOBs) in HBase.

Apache Hive

CDH 5.4.0 implements Hive 1.1.0. New capabilities include:

A test-only version of Hive on Spark with the following limitations:
- Parquet does not currently support vectorization; it simply ignores the setting of hive.vectorized.execution.enabled.
- Hive on Spark does not yet support dynamic partition pruning.
- Hive on Spark does not yet support HBase. If you want to interact with HBase, Cloudera recommends that you use Hive on MapReduce.
To deploy and test Hive on Spark in a test environment, use Cloudera Manager (seeConfiguring Hive on Spark).
Important: Hive on Spark is included in CDH 5.4 and higher but is not currently supported nor recommended for production use. To try this feature, use it in a test environment until Cloudera resolves currently existing issues and limitations to make it ready for production use.
Support for JAR files changes without scheduled maintenance.
To implement this capability, proceed as follows:
1. Set hive.reloadable.aux.jars.path in /etc/hive/conf/hive-site.xml to the directory that contains the JAR files.
2. Execute the reload; statement on HiveServer2 clients such as Beeline and the Hive JDBC.
Beeline support for retrieving and printing query logs.

Some features in the upstream release are not yet supported for production use in CDH; these include:

HIVE-7935 - Support dynamic service discovery for HiveServer2
HIVE-6455 - Scalable dynamic partitioning and bucketing optimization
HIVE-5317 - Implement insert, update, and delete in Hive with full ACID support
HIVE-7068 - Integrate AccumuloStorageHandler
HIVE-7090 - Support session-level temporary tables in Hive
HIVE-7341 - Support for Table replication across HCatalog instances
HIVE-4752 - Add support for HiveServer2 to use Thrift over HTTP

Hue

CDH 5.4.0 adds the following:

New Oozie editor
Performance improvements
New Search facets
HBase impersonation

Kite

Kite in CDH has been rebased on the 1.0 release upstream. This breaks backward compatibility with existing APIs. The APIs are documented at http://kitesdk.org/docs/1.0.0/apidocs/index.html.

Notable changes are:

Dataset writers that implement flush and sync now extend interfaces (Flushable and Syncable). Writers that no longer have misleading flush and sync methods.
DatasetReaderException, DatasetWriterException, and DatasetRepositoryException have been removed and replaced with more specific exceptions, such as IncompatibleSchemaException. Exception classes now indicate what went wrong instead of what threw the exception.
The partition API is no longer exposed; use the view API instead.
kite-data-hcatalog is now kite-data-hive.

Apache Oozie

Added Spark action which lets you run Spark applications from Oozie workflows. See the Oozie documentation for more details.
The Hive2 action now collects and reports Hadoop Job IDs for MapReduce jobs launched by Hive Server 2.
The launcher job now uses YARN uber mode for all but the Shell action; this reduces the overhead (time and resources) of running these Oozie actions.

Apache Parquet

Parquet memory manager now changes the row group size if the current size is expected to cause out-of-memory (OOM) errors because too many files are open. This causes a WARN message to be printed in the logs. A new setting, parquet.memory.pool.ratio, controls the percentage of the JVM's heap memory Parquet attempts to use.
To improve job startup time, footers are no longer read by default for MapReduce jobs (PARQUET-139).
Note:
To revert to the old behavior (ParquetFileReader reads in all the files to obtain the footers), set parquet.task.side.metadata to false in the job configuration.
The Parquet Avro object model can now read lists and maps written by Hive, Avro, and Thrift (similar capabilities were added to Hive in CDH 5.3). This compatibility fix does not change behavior. The extra record layer wrapping the list elements when Avro reads lists written by Hive can now be removed; to do this, set the expected Avro schema or set parquet.avro.add-list-element-records to false.
Avro's map representation now writes null values correctly.
The Parquet Thrift object model can now read data written by other object models (such as Hive, Impala, or Parquet-Avro), given a Thrift class for the data; compile a Thrift definition into an object, and supply it when creating the job.

Cloudera Search

Solr metadata stored in ZooKeeper can now be protected by Zookeeper ACLs. In a Kerberos-enabled environment, Solr metadata stored in ZooKeeper is owned by the solr user and cannot be modified by other users.
Note:
- The Solr principal name can be configured in Cloudera Manager. The default name is solr, although other names can be specified.
- Collection configuration information stored under the /solr/configs znode in not affected by this change. As a result, collection configuration behavior is unchanged.
Administrators who modify Solr ZooKeeper metadata through operations like solrctl init or solrctl cluster --put-solrxml must now supply solrctl with a JAAS configuration using the --jaas configuration parameter. The JAAS configuration must specify the principal, typically solr, that the solr process uses. See solrctl Reference for more information.

End users, who typically do not need to modify Solr metadata, are unaffected by this change.
Lily HBase Indexer metadata stored in ZooKeeper can now be protected by Zookeeper ACLs. In a Kerberos-enabled environment, Lily HBase Indexer metadata stored in ZooKeeper is owned by the Solr user and cannot be modified by other users.

End users, who typically do not manage the Lily HBase Indexer, are unaffected by this change.
The Lily HBase Indexer supports restricting access using Sentry. For more information, see Configuring Lily HBase Indexer Security.
Services included with Search for CDH 5.4.0, including Solr, Key-Value Store Indexer, and Flume, now support SSL.
The Spark Indexer and the Lily HBase Batch Indexer support delegation tokens for mapper-only jobs. For more information, see Spark Indexing and HBaseMapReduceIndexerTool.
Search for CDH 5.4.0 implements SOLR-5746, which improves solr.xml file parsing. Error checking for duplicated options or unknown option names was added. These checks can help identify mistakes made during manual edits of the solr.xml file. User-modified solr.xml files may cause errors on startup due to these parsing improvements.
By default, CloudSolrServer now uses multiple threads to add documents.

Note: Note: Due to multithreading, if document addition is interrupted by an exception, some documents, in addition to the one being added when the failure occurred, may be added.

To get the old, single-threaded behavior, set parallel updates to false on the CloudSolrServer instance.

Related JIRA: SOLR-4816.
Updates are routed directly to the correct shard leader, eliminating document routing at the server. This allows for near linear indexing throughput scalability. Document routing requires that the solrj client must know each document’s unique identifier. The unique identifiers allow the client to route the update directly to the correct shard. For additional information, see Shards and Indexing Data in SolrCloud.

Related JIRA: SOLR-4816.
The loadSolr morphline command supports nested documents. For more information, see Morphlines Reference Guide.
Navigator can be used to audit Cloudera Search activity. For more information on the Solr operations that can be audited, see Exploring Audit Data.
Search for CDH 5.4 supports logging queries before they are executed. This allows you can identify queries that could increase resource consumption. This also enables improving schemas or filters to meet your performance requirements. To enable this feature, set the SolrCore and SolrCore.Request log level to DEBUG.

Related JIRA: SOLR-6919
UniqFieldsUpdateProcessorFactory, which Solr Server implements, has been improved to support all of the FieldMutatingUpdateProcessorFactory selector options. The <lst named="fields"> init param option is deprecated. Replace this option with <arr name="fieldName">.

If the <lst named="fields"> init param option is used, Solr logs a warning.
Related JIRA: SOLR-4249.
Configuration information was previously available using FieldMutatingUpdateProcessorFactory (oneOrMany or getBooleanArg). Those methods are now deprecated. The methods have been moved to NamedList and renamed to removeConfigArgs and removeBooleanArg, respectively.

If the oneOrMany or getBooleanArg methods of FieldMutatingUpdateProcessorFactory are used, Solr logs a warning.
Related JIRA: SOLR-5264.

Apache Spark

CDH 5.4.0 Spark is rebased on Apache Spark 1.3.0 and provides the following new capabilities:

Spark Streaming WAL (write-ahead log) on HDFS, preventing any data loss on driver failure
Kafka connector for Spark Streaming to avoid the need for the HDFS WAL
Spark Streaming recovery is supported for production use
Spark external shuffle service
Improvements in automatically setting CDH classpaths for Avro, Parquet, Flume, and Hive
Improvements in the collection of task metrics

The following is not supported in a production environment because of its immaturity:

Spark SQL (which now includes dataframes)

Apache Sqoop

Sqoop 2:
- Implements Sqoop 2 version 1.99.5.
- Sqoop 2 supports Kerberos.
- Sqoop 2 supports PostgreSQL as the repository database.

What's New in CDH 5.4.1

This is a maintenance release that fixes the following issue; for details of other important fixes, see Issues Fixed in CDH 5.4.1.

Upgrades to CDH 5.4.1 from Releases Earlier than 5.4.0 May Fail

Problem: Because of a change in the implementation of the NameNode metadata upgrade mechanism, upgrading to CDH 5.4.1 from a version lower than 5.4.0 can take an inordinately long time. In a cluster with NameNode high availability (HA) configured and a large number of edit logs, the upgrade can fail, with errors indicating a timeout in the pre-upgrade step on JournalNodes.

What to do:

To avoid the problem: Do not upgrade to CDH 5.4.1; upgrade to CDH 5.4.2 instead.

If you experience the problem: If you have already started an upgrade and seen it fail, contact Cloudera Support. This problem involves no risk of data loss, and manual recovery is possible.

If you have already completed an upgrade to CDH 5.4.1, or are installing a new cluster: In this case you are not affected and can continue to run CDH 5.4.1.

Cloudera Search

Beginning with CDH 5.4.1, Search for CDH supports configurable transaction log replication levels for replication logs stored in HDFS. For more information, see the Transaction Log Replication section in Replication.

Apache Spark

Spark supports submitting python applications in cluster mode.

What's New in CDH 5.4.2

This is a maintenance release that fixes the following issue; for details of other important fixes, see Issues Fixed in CDH 5.4.2.

Upgrades to CDH 5.4.1 from Releases Earlier than 5.4.0 May Fail

What to do:

To avoid the problem: Do not upgrade to CDH 5.4.1; upgrade to CDH 5.4.2 instead.

If you experience the problem: If you have already started an upgrade and seen it fail, contact Cloudera Support. This problem involves no risk of data loss, and manual recovery is possible.

If you have already completed an upgrade to CDH 5.4.1, or are installing a new cluster: In this case you are not affected and can continue to run CDH 5.4.1.

What's New in CDH 5.4.3

This is a maintenance release that fixes the following issue; for details of other important fixes, see Issues Fixed in CDH 5.4.3.

NameNode Incorrectly Reports Missing Blocks During Rolling Upgrade

Problem: During a rolling upgrade to any of the releases listed below, the NameNode may report missing blocks after rolling back multiple DataNodes. This is caused by a race condition with block reporting between the DataNode and the NameNode. No permanent data loss occurs, but data can be unavailable for up to six hours before the problem corrects itself.

Releases affected: CDH 5.0.6, 5.1.5, 5.2.5, 5.3.3, 5.4.1, 5.4.2

What to do:

To avoid the problem: Cloudera advises skipping the affected releases and installing a release containing the fix. For example, do not upgrade to CDH 5.4.2; upgrade to CDH 5.4.3 instead.

The releases containing the fix are: CDH 5.3.4, 5.4.3

If you have already completed an upgrade to an affected release, or are installing a new cluster: You can continue to run the release, or upgrade to a release that is not affected.

What's New in CDH 5.4.4

This is a maintenance release that fixes some important issues. For details, see Issues Fixed in CDH 5.4.4.

What's New in CDH 5.4.5

This is a maintenance release that fixes some important issues. For details, see Issues Fixed in CDH 5.4.5.

What's New in CDH 5.4.7

This is a maintenance release that fixes some important issues. For details, see Issues Fixed in CDH 5.4.7

What's New in CDH 5.4.8

This is a maintenance release that fixes some important issues; for details, see Issues Fixed in CDH 5.4.8

What's New in CDH 5.4.9

This is a maintenance release that fixes some important issues. For details, see Issues Fixed in CDH 5.4.9.

What's New in CDH 5.4.10

This is a maintenance release that fixes some important issues. For details, see Issues Fixed in CDH 5.4.10.

What's New in CDH 5.4.11

This is a maintenance release that fixes some important issues. For details, see Issues Fixed in CDH 5.4.11 .

What's New In CDH 5.5.x

What's New In CDH 5.3.x