What's New In CDH 5.2.x

Continue reading:

What's New in CDH 5.2.0
What's New in CDH 5.2.1
What's New in CDH 5.2.3
What's New in CDH 5.2.4
What's New in CDH 5.2.5
What's New in CDH 5.2.6

What's New in CDH 5.2.0

CDH 5.2.0 is a minor release which includes new features and bug fixes.

Go to Issues Fixed In CDH 5.2.0 or keep reading for New Features and Changes.

New Features and Changes

New features and changes are grouped by area.

Operating System Support
Apache Avro
Apache Hadoop
Apache Crunch
Apache Flume
Hue
Apache HBase
Apache Hive
Impala
Kite
Apache Mahout
Apache Oozie
Apache Parquet (incubating)
Cloudera Search
Apache Sentry (incubating)
Apache Spark
Apache Sqoop

Operating System Support

CDH 5.2.0 adds support for Ubuntu Trusty (version 14.04). See CDH and Cloudera Manager Supported Operating Systems.

Apache Avro

CDH 5.2 implements Avro version 1.7.6, with backports from 1.7.7. Important changes include:

AVRO-1398: Increase default sync interval from 16k to 64k. There is a very small chance this could causes an incompatibility in some cases, but you can be control the interval by setting avro.mapred.sync.interval in the MapReduce job configuration. For example, set it to 16000 to get the old behavior.
AVRO-1355: Record schema should reject duplicate field names. This change rejects schemas with duplicate field names. This could affect some applications, but if schemas have duplicate field names then they are unlikely to work properly in any case. The workaround is to make sure a record's field names are unique within the record.

Apache Hadoop

HDFS

CDH 5.2 provides the following new capabilities:

HDFS Data at Rest Encryption
Note: Cloudera provides the following two solutions for data at rest encryption:
- Navigator Encrypt - Is production ready and available for Cloudera customers licensed for Cloudera Navigator. Navigator Encrypt operates at the Linux volume level, so it can encrypt cluster data inside and outside HDFS. Talk to your Cloudera account team for more information about this capability.
- HDFS Encryption - Included in CDH 5.2.0 and operates at the HDFS folder level, enabling encryption to be applied only to HDFS folders where needed. This feature has several known limitations. Therefore, Cloudera does not currently support this feature in CDH 5.2 and it is not recommended for production use. To try the feature, upgrade to the latest version of CDH 5.
  HDFS now implements transparent, end-to-end encryption of data read from and written to HDFS by creating encryption zones. An encryption zone is a directory in HDFS with all of its contents, that is, every file and subdirectory in it, encrypted. You can use either the KMS or the Key Trustee service to store, manage, and access encryption zone keys. For more information, see HDFS Transparent Encryption.
HDFS now implements transparent, end-to-end encryption of data read from and written to HDFS by creating encryption zones. An encryption zone is a directory in HDFS with all of its contents, that is, every file and subdirectory in it, encrypted. For more details, see HDFS Transparent Encryption.
Extended attributes: HDFS XAttrs allow extended attributes to be stored per file (https://issues.apache.org/jira/browse/HDFS-2006).
Authentication improvements when using an HTTP proxy server.
A new Hadoop Metrics sink that allows writing directly to Graphite.
Specification for Hadoop Compatible Filesystem effort.
OfflineImageViewer to browse an fsimage via the WebHDFS API.
Supportability improvements and bug fixes to the NFS gateway.
Modernized web UIs (HTML5 and JavaScript) for HDFS daemons.

MapReduce

CDH 5.2 provides an optimized implementation of the mapper side of the MapReduce shuffle. The optimized implementation may require tuning different from the original implementation, and so it is considered experimental and is not enabled by default.

You can select this new implementation on a per-job basis by setting the job configuration value mapreduce.job.map.output.collector.class to org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator, or use enable Cloudera Manager to enable it.

Some jobs which use custom writable types or comparators may not be able to take advantage of the optimized implementation.

the following new capabilities and improvements:

YARN

CDH 5.2 provides the following new capabilities and improvements:

New features and improvements in the Fair Scheduler:
- New features:
  - Fair Scheduler now allows setting the fairsharePreemptionThreshold per queue (leaf and non-leaf). This threshold is a decimal value between 0 and 1; if a queue's usage is under (preemption-threshold * fairshare) for a configured duration, resources from other queues are preempted to satisfy this queue's request. Set this value in fair-scheduler.xml. The default value is 0.5.
  - Fair Scheduler now allows setting the fairsharePreemptionTimeout per queue (leaf and non-leaf). For a starved queue, this timeout determines when to trigger preemption from other queues. Set this value in fair-scheduler.xml.
  - Fair Scheduler now shows the Steady Fair Share in the Web UI. The Steady Fair Share is the share of the cluster resources a particular queue or pool would get if all existing queues had running applications.
- Improvements:
  - Fair Scheduler uses Instantaneous Fair Share ( fairshare that considers only active queues) for scheduling decisions to improve the time to achieve steady state (fairshare).
  - The default for maxAMShare is now 0.5, meaning that only half the cluster's resources can be taken up by Application Masters. You can change this value in fair-scheduler.xml.
YARN's REST APIs support submitting and killing applications.
YARN's timeline store is integrated with Kerberos.

Apache Crunch

CDH 5.2 provides the following new capabilities:

Improvements in Scrunch, including:
- New join API that matches the one in Crunch
- New aggregation API, including support for Algebird-based aggregations
- Built-in serialization support for all tuple types as well as case classes.
A new module, crunch-hive, for reading and writing Optimized Row Columnar (ORC) Files with Crunch.

Apache Flume

CDH 5.2 provides the following new capabilities:

Kafka Integration: Flume can now accept data from Kafka via the KafkaSource (FLUME-2250) and push to Kafka using the KafkaSink (FLUME-2251).
Kite Sink can now write to Hive and HBase datasets (FLUME-2463).
Flume agents can now be configured via Zookeeper (experimental, FLUME-1491)
Embedded Agents now support Interceptors (FLUME-2426)
syslog Sources now support configuring which fields should be kept (FLUME-2438)
File Channel replay is now much faster (FLUME-2450)
New regular-expression search-and-replace interceptor (FLUME-2431)
Backup checkpoints can be optionally compressed (FLUME-2401)

Hue

CDH 5.2 provides the following new capabilities:

New application for editing Sentry roles and Privileges on databases and tables
Search App
Heatmap, Tree, Leaflet widgets
Micro-analysis of fields
Exclusion facets
Oozie Dashboard: bulk actions, faster display
File Browser: drag-and-drop upload, history, ACLs edition
Hive and Impala: LDAP pass-through, query expiration, TLS/SSL (Hive), new graphs
Job Browser: YARN kill application button

Apache HBase

CDH 5.2 implements HBase 0.98.6, which represents a minor upgrade to HBase. This upgrade introduces new features and moves some features which were previously marked as experimental to fully supported status.

Apache Hive

CDH 5.2 introduces the following important changes in Hive.

CDH 5.2 implements Hive 0.13, providing the following new capabilities:
- Sub-queries in the WHERE clause
- Common table expressions (CTE)
- Parquet supports timestamp
- HiveServer2 can be configured with a hiverc file that is automatically run when users connect
- Permanent UDFs
- HiveServer2 session and operation timeouts
- Beeline accepts a -i option to initialize with a SQL file
- New join syntax (implicit joins)
As of CDH 5.2.0, you can create Avro-backed tables simply by using STORED AS AVRO in a DDL statement. The AvroSerDe takes care of creating the appropriate Avro schema from the Hive table schema, making it much easier to use Avro with Hive.
Hive supports additional datatypes, as follows:
- Hive can read char and varchar datatypes written by Hive, and char and varchar datatypes written by Impala.
- Impala can read char and varchar datatypes written by Hive and Impala.
These new types have been enabled by expanding the supported DDL, so they are backward compatible. You can add varchar(n) columns by creating new tables with that type, or changing a string column in existing tables to varchar.
Note:
char(n) columns are not stored in a fixed-length representation, and do not improve performance (as they do in some other databases). Cloudera recommends that in most cases you use text or varchar instead.
DESCRIBE DATABASE returns additional fields: owner_name and owner_type. The command will continue to behave as expected if you identify the field you're interested in by its (string) name, but could produce unexpected results if you use a numeric index to identify the field(s).

Impala

Impala in CDH 5.2.0 includes major new features such as spill-to-disk for memory-intensive queries, subquery enhancements, analytic functions, and new CHAR and VARCHAR data types. For the full feature list and more details, see What's New in Apache Impala.

Kite

Kite is an open source set of libraries, references, tutorials, and code samples for building data-oriented systems and applications. For more information about Kite, see the Kite SDK Development Guide.

Kite has been rebased to version 0.15.0 in CDH 5.2.0, from the base version 0.10.0 in CDH 5.1. kite-morphlines modules are backward-compatible, but this change breaks backward-compatibility for the kite-data API.

Kite Data

The Kite data API has had substantial updates since the version included in CDH 5.1.

Changes from 0.15.0

The Kite version in CDH 5.2 is based on 0.15.0, but includes some newer changes. Specifically, it includes support for dataset namespaces, which can be used to set the database in the Hive Metastore.

The introduction of namespaces changed the file system repository layout; now there is an additional namespace directory for datasets stored in HDFS (repository/namespace/dataset/). There are no compatibility problems when you use Dataset URIs, but all datasets created with the DatasetRepository API will be located in a namespace directory. This new directory level is not expected in Kite 0.15.0 or 0.16.0 and will prevent the dataset from being loaded. The work-around is to switch to using Dataset URIs (see below) that include the namespace component. Existing datasets will work without modification.

Except as noted above, Kite 0.15.0 in CDH 5.2 is fully backward-compatible. It can load datasets written with any previous Kite version.

Dataset URIs

Datasets are identified with a single URI, rather than a repository URI and dataset name. The dataset URI contains all the information Kite needs to determine which implementation (Hive, HBase, or HDFS) to use for the dataset, and includes both the dataset's name and a namespace.

The Kite API has been updated so that developers call methods in the Datasets utility class as they would use DatasetRepository methods. The Datasets methods are recommended, and the DatasetRepository API is deprecated.

Views

The Kite data API now allows you to select a view of the dataset by setting constraints. These constraints are used by Kite to automatically prune unnecessary partitions and filter records.

MapReduce input and output formats

The kite-data-mapreduce module has been added. It provides both DatasetKeyInputFormat and DatasetKeyOutputFormat that allow you to run MapReduce jobs over datasets or views. Spark is also supported by the input and output formats.

Dataset CLI tool

Kite now includes a command-line utility that can run common maintenance tasks, like creating a dataset, migrating a dataset's schema, copying from one dataset to another, and importing CSV data. It also has helpers that can create Avro schemas from data files and other Kite-related configuration.

Flume DatasetSink

The Flume DatasetSink has been updated for the kite-data API changes. It supports all previous configurations without modification.

In addition, the DatasetSink now supports dataset URIs with the configuration option kite.dataset.uri.

Apache Mahout

Mahout jobs launched from the bin/mahout script will now use cluster's default parameters, rather than hard-coded parameters from the library. This may change the algorithms' run-time behavior, possibly for the better. (MAHOUT-1565.)

Apache Oozie

CDH 5.2 introduces the following important changes:

A new Hive 2 Action allows Oozie to run HiveServer2 scripts. Using the Hive Action with HiveServer2 is now deprecated; you should switch to the new Hive 2 Action as soon as possible.
The MapReduce action can now also be configured by Java code

This gives users the flexibility of using their own driver Java code for configuring the MR job, while also getting the advantages of the MapReduce action (instead of using the Java action). See the documentation for more info.
The PurgeService is now able to remove completed child jobs from long running coordinator jobs
ALL can now be set for oozie.service.LiteWorkflowStoreService.user.retry.error.code.ext to make Oozie retry actions automatically for every type of error
All Oozie servers in an Oozie HA group now synchronize on the same randomly generated rolling secret for signing auth tokens
You can now upgrade from CDH 4.x to CDH 5.2 and later with jobs in RUNNING and SUSPENDED states. (An upgrade from CDH 4.x to a CDH 5.x release earlier than CDH 5.2.0 would still require that no jobs be in either of those states).

Apache Parquet (incubating)

CDH 5.2 Parquet is rebased on Parquet 1.5 and Parquet-format 2.1.0.

Cloudera Search

New Features:

Cloudera Search adds support for Spark indexing using the CrunchIndexerTool. For more information, see Spark Indexing.
Cloudera Search adds fault tolerance for single-shard deployments. This fault tolerance is enabled with a new -a option in solrctl, which configures shards to automatically be re-added on an existing, healthy node if the node hosting the shard become unavailable.
Components of Cloudera Search include Kite 0.15.0. This includes all morphlines-related backports of all fixes and features in Kite 0.17.0. For additional information on Kite, see:
Search adds support for multi-threaded faceting on fields. This enables parallelizing operations, allowing them to run more quickly on highly concurrent hardware. This is especially helpful in cases where faceting operations apply to large datasets over many fields. For more information, see Tuning the Solr Server.
Search adds support for distributed pivot faceting, enabling faceting on multi-shard collections.

Apache Sentry (incubating)

CDH 5.2 introduces the following changes to Sentry.

Sentry Service:

If you are using the database-backed Sentry service, upgrading from CDH 5.1 to CDH 5.2 will require a schema upgrade. For instructions, see Upgrading CDH Using Cloudera Manager.
Hive SQL Syntax:
- GRANT and REVOKE statements have been expanded to include WITH GRANT OPTION, thus allowing you to delegate granting and revoking privileges.
- The SHOW GRANT ROLE command has been updated to allow non-admin users to list grants for roles that are currently assigned to them.
- The SHOW ROLE GRANT GROUP <groupName> command has been updated to allow non-admin users that are part of the group specified by <groupName> to list all roles assigned to this group.
  For more details on these changes, see the updated Hive SQL Syntax for Use with Sentry.

Apache Spark

CDH 5.2 Spark is rebased on Apache Spark/Streaming 1.1 and provides the following new capabilities:

Stability and performance improvements.
New sort-based shuffle implementation (disabled by default).
Better performance monitoring through the Spark UI.
Support for arbitrary Hadoop InputFormats in PySpark.
Improved Yarn support with several bug fixes.

Apache Sqoop

CDH 5.2 Sqoop 1 is rebased on Sqoop 1.4.5 and includes the following changes:

Mainframe connector added.
Parquet support added.

There are no changes for Sqoop 2.

What's New in CDH 5.2.1

CDH 5.2.1 maintenance release that fixes the “POODLE” and Apache Hadoop Distributed Cache vulnerabilities described below. All CDH 5.2.0 users should upgrade to 5.2.1 as soon as possible.

Go to Issues Fixed In CDH 5.2.1.

“POODLE” Vulnerability on TLS/SSL enabled ports
Apache Hadoop Distributed Cache Vulnerability

“POODLE” Vulnerability on TLS/SSL enabled ports

The POODLE (Padding Oracle On Downgraded Legacy Encryption) attack forces the use of the obsolete SSLv3 protocol and then exploits a cryptographic flaw in SSLv3. The only solution is to disable SSLv3 entirely. This requires changes across a wide variety of components of CDH and Cloudera Manager in 5.2.0 and all earlier versions. CDH 5.2.1 provides these changes for CDH 5.2.0 deployments. For more information, see the Cloudera Security Bulletin.

Apache Hadoop Distributed Cache Vulnerability

The Distributed Cache Vulnerability allows a malicious cluster user to expose private files owned by the user running the YARN NodeManager process. For more information, see the Cloudera Security Bulletin.

What's New in CDH 5.2.3

This is a maintenance release that fixes some important issues; for details, see Issues Fixed in CDH 5.2.3.

What's New in CDH 5.2.4

This is a maintenance release that fixes some important issues; for details, see Issues Fixed in CDH 5.2.4.

What's New in CDH 5.2.5

This is a maintenance release that fixes some important issues; for details, see Issues Fixed in CDH 5.2.5.

What's New in CDH 5.2.6

This is a maintenance release that fixes some important issues; for details, see Issues Fixed in CDH 5.2.6.

What's New In CDH 5.3.x

What's New in CDH 5.1.x