What's New in CDH 5.15.x

The following sections describe new features introduced in 5.15.0.

Apache Flume

You can use Cloudera Manager to configure Flume to communicate with Kafka sources, sinks, and channels over TLS. The Kafka Service property in Cloudera Manager allows you to select a dependent Kafka service for the Flume service. Cloudera Manager then creates jaas.conf and flume.keytab files and adds Kafka security properties to the Flume configuration file.

For more information about how to use Cloudera Manager to configure Flume and Kafka communication, see Configuring Flume Security with Kafka.

Apache Hadoop

HDFS Immutable Snapshots

You can now enable immutable snapshots for HDFS. When you enable immutable snapshots, HDFS captures the sizes of files that are currently open (being written to). If immutable snapshot are not enabled, the sizes of those files in the snapshot will continue to grow based on any appends to the files. Therefore, immutable snapshots provide a more accurate point-in-time snapshot of HDFS.

This feature is useful for using snapshots with replication to prevent failures caused by file sizes growing in the source.

To enable immutable snapshots, open the Cloudera Manager Admin Console, select <HDFS service> > Configuration and search for the Enable Immutable Snapshots property.

You must enable immutable snapshots to use snapshot diff-based replication for Cloudera Manager's Backup and Disaster Recovery. For more information, see What's New in Cloudera Manager 5.15.0.

Hue

End-user Data Catalog improvements:

  • Simpler top table search
  • Unification and caching of all SQL metadata (Hive, Navigator, NavOpt)

Apache Impala

  • Added the TABLESAMPLE clause in the COMPUTE STATS statement. See COMPUTE STATS Statement for the new TABLESAMPLE clause.
  • Extended COMPUTE STATS to support a list of columns. See COMPUTE STATS Statement for the new syntax.
  • Added the new COMPUTE_STATS_MIN_SAMPLE_SIZE query option. The query option specifies the minimum number of bytes that will be scanned in COMPUTE STATS TABLESAMPLE, regardless of the user-supplied sampling percent. See COMPUTE_STATS_MIN_SAMPLE_SIZE Query Option.
  • Added a TBLPROPERTY for controlling stats extrapolation on a per-table basis: impala.enable.stats.extrapolation=true/false. See Table and Column Statistics for information about stats extrapolation.
  • Added the new built-in regex_escape function. The function instructs Impala to interpret the following special characters literally rather than as special characters: .\+*?[^]$(){}=!<>|:-

    See Impala String Functions for information about the regex_escape function.

  • Enhanced the existing ltrim and rtrim string functions to accept an argument that specifies a set of characters to be trimmed from the input string. See STRING Data Type for information about the functions.
  • Implemented murmur_hash function. See Impala Mathematical Functions for information about the new function.
  • Introduced the support for the Kudu DECIMAL type in Kudu 1.7.0.
  • Now Impala maps a signed integer logical type in Parquet to a supported Impala column type as below:
    • INT_8 -> TINYINT
    • INT_16 -> SMALLINT
    • INT_32 -> INT
    • INT_64 -> BIGINT
  • Parquet dictionary filtering now works on nested data.
  • Based on the existing Parquet column chunk level statistics null_count, Impala's Parquet scanner was enhanced to skip an entire row group if the null_count statistics indicate that all the values under the predicated column are NULL as no result rows would be returned from that row group.
  • The Oracle-style hint placement for INSERT statements is now supported. See Optimizer Hints in Impala for information on hints in Impala SQL.

  • Insert plan hints for CREATE TABLE AS SELECT are now supported. See Optimizer Hints in Impala for information on hints in Impala SQL.
  • Improved concurrency of DDL and DML operations during catalog updates.
  • The statestore update logic was improved to reduce issues, such as too many queries being admitted by different coordinators, or queries being queued for longer than necessary and blocking subsequent updates to different topics.
  • The size limit for statestore updates was increased, and copying of the metadata and reduce the memory footprint were reduced. Now the catalog objects are passed and (de)compressed between FE and BE one at a time.

Apache Kudu

Starting with Apache Kudu 1.5.0 / CDH 5.13.x, Kudu has been fully integrated into CDH. Kudu now ships as part of the CDH parcel and packages. The documentation for Kudu has also been incorporated into the Cloudera Enterprise documentation here.

For a complete list of new features and changes introduced in Kudu (in CDH 5.15), see What's New in Apache Kudu.

Apache Spark

More flexibility to interpret TIMESTAMP values written by Impala. Setting the spark.sql.parquet.int96TimestampConversion configuration setting to true makes Spark interpret TIMESTAMP values, when reading from Parquet files written by Impala, without applying any adjustment from the UTC to the local time zone of the server. This behavior provides better interoperability for Parquet data written by Impala, which does not apply any time zone adjustment to TIMESTAMP values when reading or writing them.