What's New In CDH 5.2.x
What's New in CDH 5.2.0
CDH 5.2.0 is a minor release which includes new features and bug fixes.
New Features and Changes
Operating System Support
CDH 5.2.0 adds support for Ubuntu Trusty (version 14.04). See CDH and Cloudera Manager Supported Operating Systems.
- AVRO-1398: Increase default sync interval from 16k to 64k. There is a very small chance this could causes an incompatibility in some cases, but you can be control the interval by setting avro.mapred.sync.interval in the MapReduce job configuration. For example, set it to 16000 to get the old behavior.
- AVRO-1355: Record schema should reject duplicate field names. This change rejects schemas with duplicate field names. This could affect some applications, but if schemas have duplicate field names then they are unlikely to work properly in any case. The workaround is to make sure a record's field names are unique within the record.
CDH 5.2 provides the following new capabilities:
- HDFS Data at Rest Encryption HDFS now implements transparent, end-to-end encryption of data read from and written to HDFS by creating encryption zones. An encryption zone is a directory in HDFS with all of its contents, that is, every file and subdirectory in it, encrypted. For more details, see HDFS Transparent Encryption.
- Extended attributes: HDFS XAttrs allow extended attributes to be stored per file (https://issues.apache.org/jira/browse/HDFS-2006).
- Authentication improvements when using an HTTP proxy server.
- A new Hadoop Metrics sink that allows writing directly to Graphite.
- Specification for Hadoop Compatible Filesystem effort.
- OfflineImageViewer to browse an fsimage via the WebHDFS API.
- Supportability improvements and bug fixes to the NFS gateway.
CDH 5.2 provides an optimized implementation of the mapper side of the MapReduce shuffle. The optimized implementation may require tuning different from the original implementation, and so it is considered experimental and is not enabled by default.
You can select this new implementation on a per-job basis by setting the job configuration value mapreduce.job.map.output.collector.class to org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator, or use enable Cloudera Manager to enable it.
Some jobs which use custom writable types or comparators may not be able to take advantage of the optimized implementation.
the following new capabilities and improvements:
CDH 5.2 provides the following new capabilities and improvements:
- New features and improvements in the Fair Scheduler:
- New features:
- Fair Scheduler now allows setting the fairsharePreemptionThreshold per queue (leaf and non-leaf). This threshold is a decimal value between 0 and 1; if a queue's usage is under (preemption-threshold * fairshare) for a configured duration, resources from other queues are preempted to satisfy this queue's request. Set this value in fair-scheduler.xml. The default value is 0.5.
- Fair Scheduler now allows setting the fairsharePreemptionTimeout per queue (leaf and non-leaf). For a starved queue, this timeout determines when to trigger preemption from other queues. Set this value in fair-scheduler.xml.
- Fair Scheduler now shows the Steady Fair Share in the Web UI. The Steady Fair Share is the share of the cluster resources a particular queue or pool would get if all existing queues had running applications.
- Fair Scheduler uses Instantaneous Fair Share ( fairshare that considers only active queues) for scheduling decisions to improve the time to achieve steady state (fairshare).
- The default for maxAMShare is now 0.5, meaning that only half the cluster's resources can be taken up by Application Masters. You can change this value in fair-scheduler.xml.
- New features:
- YARN's REST APIs support submitting and killing applications.
- YARN's timeline store is integrated with Kerberos.
- Improvements in Scrunch, including:
- New join API that matches the one in Crunch
- New aggregation API, including support for Algebird-based aggregations
- Built-in serialization support for all tuple types as well as case classes.
- A new module, crunch-hive, for reading and writing Optimized Row Columnar (ORC) Files with Crunch.
CDH 5.2 provides the following new capabilities:
- Kafka Integration: Flume can now accept data from Kafka via the KafkaSource (FLUME-2250) and push to Kafka using the KafkaSink (FLUME-2251).
- Kite Sink can now write to Hive and HBase datasets (FLUME-2463).
- Flume agents can now be configured via Zookeeper (experimental, FLUME-1491)
- Embedded Agents now support Interceptors (FLUME-2426)
- syslog Sources now support configuring which fields should be kept (FLUME-2438)
- File Channel replay is now much faster (FLUME-2450)
- New regular-expression search-and-replace interceptor (FLUME-2431)
- Backup checkpoints can be optionally compressed (FLUME-2401)
CDH 5.2 provides the following new capabilities:
- New application for editing Sentry roles and Privileges on databases and tables
- Search App
- Heatmap, Tree, Leaflet widgets
- Micro-analysis of fields
- Exclusion facets
- Oozie Dashboard: bulk actions, faster display
- File Browser: drag-and-drop upload, history, ACLs edition
- Hive and Impala: LDAP pass-through, query expiration, TLS/SSL (Hive), new graphs
- Job Browser: YARN kill application button
CDH 5.2 implements HBase 0.98.6, which represents a minor upgrade to HBase. This upgrade introduces new features and moves some features which were previously marked as experimental to fully supported status. For detailed information and instructions on how to use the new capabilities, see New Features and Changes for HBase in CDH 5.
CDH 5.2 introduces the following important changes in Hive.
- CDH 5.2 implements Hive 0.13, providing the following new capabilities:
- Sub-queries in the WHERE clause
- Common table expressions (CTE)
- Parquet supports timestamp
- HiveServer2 can be configured with a hiverc file that is automatically run when users connect
- Permanent UDFs
- HiveServer2 session and operation timeouts
- Beeline accepts a -i option to initialize with a SQL file
- New join syntax (implicit joins)
- As of CDH 5.2.0, you can create Avro-backed tables simply by using STORED AS AVRO in a DDL statement. The AvroSerDe takes care of creating the appropriate Avro schema from the Hive table schema, making it much easier to use Avro with Hive.
- Hive supports additional datatypes, as follows:
- Hive can read char and varchar datatypes written by Hive, and char and varchar datatypes written by Impala.
- Impala can read char and varchar datatypes written by Hive and Impala.
- DESCRIBE DATABASE returns additional fields: owner_name and owner_type. The command will continue to behave as expected if you identify the field you're interested in by its (string) name, but could produce unexpected results if you use a numeric index to identify the field(s).
Impala in CDH 5.2.0 includes major new features such as spill-to-disk for memory-intensive queries, subquery enhancements, analytic functions, and new CHAR and VARCHAR data types. For the full feature list and more details, see What's New in Apache Impala (incubating).
Kite is an open source set of libraries, references, tutorials, and code samples for building data-oriented systems and applications. For more information about Kite, see the Kite SDK Development Guide.
Kite has been rebased to version 0.15.0 in CDH 5.2.0, from the base version 0.10.0 in CDH 5.1. kite-morphlines modules are backward-compatible, but this change breaks backward-compatibility for the kite-data API.
Changes from 0.15.0
The Kite version in CDH 5.2 is based on 0.15.0, but includes some newer changes. Specifically, it includes support for dataset namespaces, which can be used to set the database in the Hive Metastore.
The introduction of namespaces changed the file system repository layout; now there is an additional namespace directory for datasets stored in HDFS (repository/namespace/dataset/). There are no compatibility problems when you use Dataset URIs, but all datasets created with the DatasetRepository API will be located in a namespace directory. This new directory level is not expected in Kite 0.15.0 or 0.16.0 and will prevent the dataset from being loaded. The work-around is to switch to using Dataset URIs (see below) that include the namespace component. Existing datasets will work without modification.
Except as noted above, Kite 0.15.0 in CDH 5.2 is fully backward-compatible. It can load datasets written with any previous Kite version.
Datasets are identified with a single URI, rather than a repository URI and dataset name. The dataset URI contains all the information Kite needs to determine which implementation (Hive, HBase, or HDFS) to use for the dataset, and includes both the dataset's name and a namespace.
The Kite API has been updated so that developers call methods in the Datasets utility class as they would use DatasetRepository methods. The Datasets methods are recommended, and the DatasetRepository API is deprecated.
The Kite data API now allows you to select a view of the dataset by setting constraints. These constraints are used by Kite to automatically prune unnecessary partitions and filter records.
MapReduce input and output formats
The kite-data-mapreduce module has been added. It provides both DatasetKeyInputFormat and DatasetKeyOutputFormat that allow you to run MapReduce jobs over datasets or views. Spark is also supported by the input and output formats.
Dataset CLI tool
Kite now includes a command-line utility that can run common maintenance tasks, like creating a dataset, migrating a dataset's schema, copying from one dataset to another, and importing CSV data. It also has helpers that can create Avro schemas from data files and other Kite-related configuration.
Mahout jobs launched from the bin/mahout script will now use cluster's default parameters, rather than hard-coded parameters from the library. This may change the algorithms' run-time behavior, possibly for the better. (MAHOUT-1565.)
CDH 5.2 introduces the following important changes:
- A new Hive 2 Action allows Oozie to run HiveServer2 scripts. Using the Hive Action with HiveServer2 is now deprecated; you should switch to the new Hive 2 Action as soon as possible.
The MapReduce action can now also be configured by Java code
This gives users the flexibility of using their own driver Java code for configuring the MR job, while also getting the advantages of the MapReduce action (instead of using the Java action). See the documentation for more info.
The PurgeService is now able to remove completed child jobs from long running coordinator jobs
ALL can now be set for oozie.service.LiteWorkflowStoreService.user.retry.error.code.ext to make Oozie retry actions automatically for every type of error
All Oozie servers in an Oozie HA group now synchronize on the same randomly generated rolling secret for signing auth tokens
You can now upgrade from CDH 4.x to CDH 5.2 and later with jobs in RUNNING and SUSPENDED states. (An upgrade from CDH 4.x to a CDH 5.x release earlier than CDH 5.2.0 would still require that no jobs be in either of those states).
Apache Parquet (incubating)
CDH 5.2 Parquet is rebased on Parquet 1.5 and Parquet-format 2.1.0.
- Cloudera Search adds support for Spark indexing using the CrunchIndexerTool. For more information, see Spark Indexing Reference (CDH 5.2 and higher only).
- Cloudera Search adds fault tolerance for single-shard deployments. This fault tolerance is enabled with a new -a option in solrctl, which configures shards to automatically be re-added on an existing, healthy node if the node hosting the shard become unavailable.
- Components of Cloudera Search include Kite 0.15.0. This includes all morphlines-related backports of all fixes and features in Kite 0.17.0. For additional information on Kite, see:
- Search adds support for multi-threaded faceting on fields. This enables parallelizing operations, allowing them to run more quickly on highly concurrent hardware. This is especially helpful in cases where faceting operations apply to large datasets over many fields. For more information, see Tuning the Solr Server.
- Search adds support for distributed pivot faceting, enabling faceting on multi-shard collections.
Apache Sentry (incubating)
CDH 5.2 introduces the following changes to Sentry.
- If you are using the database-backed Sentry service, upgrading from CDH 5.1 to CDH 5.2 will require a schema upgrade. For instructions, see Upgrading CDH and Managed Services Using Cloudera Manager.
- Hive SQL Syntax:
- GRANT and REVOKE statements have been expanded to include WITH GRANT OPTION, thus allowing you to delegate granting and revoking privileges.
- The SHOW GRANT ROLE command has been updated to allow non-admin users to list grants for roles that are currently assigned to them.
- The SHOW ROLE GRANT GROUP <groupName> command has been updated to allow non-admin users that are part of the group specified by <groupName> to list all roles assigned to this group.
For more details on these changes, see the updated Hive SQL Syntax for Use with Sentry.
CDH 5.2 Spark is rebased on Apache Spark/Streaming 1.1 and provides the following new capabilities:
- Stability and performance improvements.
- New sort-based shuffle implementation (disabled by default).
- Better performance monitoring through the Spark UI.
- Support for arbitrary Hadoop InputFormats in PySpark.
- Improved Yarn support with several bug fixes.
- Mainframe connector added.
- Parquet support added.
There are no changes for Sqoop 2.
What's New in CDH 5.2.1
CDH 5.2.1 maintenance release that fixes the “POODLE” and Apache Hadoop Distributed Cache vulnerabilities described below. All CDH 5.2.0 users should upgrade to 5.2.1 as soon as possible.
“POODLE” Vulnerability on TLS/SSL enabled ports
The POODLE (Padding Oracle On Downgraded Legacy Encryption) attack forces the use of the obsolete SSLv3 protocol and then exploits a cryptographic flaw in SSLv3. The only solution is to disable SSLv3 entirely. This requires changes across a wide variety of components of CDH and Cloudera Manager in 5.2.0 and all earlier versions. CDH 5.2.1 provides these changes for CDH 5.2.0 deployments. For more information, see the Cloudera Security Bulletin.
Apache Hadoop Distributed Cache Vulnerability
The Distributed Cache Vulnerability allows a malicious cluster user to expose private files owned by the user running the YARN NodeManager process. For more information, see the Cloudera Security Bulletin.
What's New in CDH 5.2.3
This is a maintenance release that fixes some important issues; for details, see Issues Fixed in CDH 5.2.3.
What's New in CDH 5.2.4
This is a maintenance release that fixes some important issues; for details, see Issues Fixed in CDH 5.2.4.
What's New in CDH 5.2.5
This is a maintenance release that fixes some important issues; for details, see Issues Fixed in CDH 5.2.5.
What's New in CDH 5.2.6
This is a maintenance release that fixes some important issues; for details, see Issues Fixed in CDH 5.2.6.