New Features in CDH 6.1.0

Apache Accumulo

Running Apache Accumulo on top of a CDH 6.1.x cluster is supported. Apache Accumulo is shipped separately from CDH.

Apache Avro

There are no notable new features in this release.

Apache Crunch

There are no notable new features in this release.

Apache Flume

The following list shows what's new and changed in Apache Flume for CDH 6.1.0:

Apache Hadoop

Hadoop Common

There are no notable new features in this release.

HDFS

ADLS Gen2

CDH supports using ADLS Gen2 as a storage layer for MapReduce, Hive on MapReduce, Hive on Spark, Spark, Oozie, and Impala.

For more information, see the ADLS Gen2 documentation. For information about configuring CDH and ADLS Gen2, see Configuring ADLS Gen2 Connectivity.

Google Cloud Storage

CDH supports using Google Cloud Storage (GCS) as a storage layer for Hive, MapReduce, and Spark. To use GCS, you must download the connector and distribute it to your cluster. For more information about how to do this and limitations, see Configuring Google Cloud Storage Connectivity.

CacheReplicationMonitor

You can now disable the CacheReplicationMonitor with the following Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml: dfs.namenode.caching.enabled. To maintain backwards compatibility, the default value is true to enable the default caching. To disable the CacheReplicationMonitor, set the value to false when you add the Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml:

Erasure Coding

CDH 6.1.0 supports Erasure Coding (EC). EC is an alternative to the standard 3x replication that HDFS uses by default. When an HDFS cluster uses EC, no additional direct copies of the data are generated. Instead, data is striped into blocks and encoded to generate parity blocks. If there are any missing or corrupt blocks, HDFS uses the remaining data and parity blocks to reconstruct the missing pieces in the background. This process provides a similar level of data durability to 3x replication but at a lower storage cost.

For more information, see Data Durability.

Snapshots

You can now enable immutable snapshots for HDFS with Cloudera Manager. Enabling this feature also enables snapshot diff-based copy listing for BDR.

In the Cloudera Manager Admin Console, navigate to Clusters > <HDFS cluster> > Configuration and search for Enable Immutable Snapshots.

This feature is off by default.

MapReduce

There are no notable new features in this release.

YARN

There are no notable new features in this release.

Apache HBase

Replication Status of WAL Groups in Web UI

New sections are added to Web UI to show the status of replication:
  • Peers: Shows all replication peers and some of their configuration, including peer id, cluster key, state, bandwidth, the size of current log, the log queue size, the replication offset and which namespace or table it replicates.
  • Replication status of all Region Server: Shows the delay of replication, including the AgeOfLastShippedOp, SizeOfLogQueue and ReplicationLag for each Region Server.

If the replication offset shows -1 and replication delay is UNKNOWN, that means replication is not started. There are two common reasons for this: the peer is disabled or the replicationEndpoint is sleeping.

Default Behavior Change

HBASE-20856: By default the meta WAL provider (hbase.wal.meta_provier) is set to the same as the normal WAL (hbase.wal.provider).

Apache Hive / Hive on Spark / HCatalog

Continue reading:

Apache Hive

The following are some of the notable new features in this release of Hive:

Erasure Coding Support

You can now use Erasure Coding (EC) with your infrequently accessed Hive tables and partitions. Learn how to plan, evaluate, and activate Erasure Coding in our Best Practices for Using Hive with Erasure Coding guide.

Query Plan Graph View for Hive Web UI

You can now view your query plans on an informative and visual graph. Learn how to activate the graph view, start understanding your query plans, follow the MapReduce progress bar, and pinpoint errors easily with the Query Plan Graph View for Hive Web UI.

Fine Grained Privileges

Sentry and Hive introduced fine grained privileges to provide object-level privileges to roles.

Fine grained privileges adds the CREATE privilege, which allow users to create databases and tables. See the Sentry Privileges documentation for more information about the new privileges.

Object Ownership

Object ownership designates an owner for a database, table, or view in Sentry. The owner of an object has the equivalent of the ALL privilege on the object. See the Object Ownership documentation for information about enabling object ownership.

Because of the new object ownership feature, HMS stores the user that creates a table or database in Hive as the object owner. If object ownership is enabled, Sentry grants the user the OWNER privilege. Whether or not object ownership is enabled, HMS stores the user that creates the object as the object owner. Previously, HMS stored the hive user as the object owner.

The following statements were added to Hive to support object ownership via Sentry:

  • ALTER DATABASE SET OWNER
  • ALTER TABLE SET OWNER
  • SHOW GRANT USER

Hive on Spark

There are no notable new features in this release.

Hue

The following are some of the notable new features in this release of Hue:

  • Language Reference built-in, Column Sampling, black theme for Editor
  • Simplifying the end user Data Catalog search
  • Improved SQL Exploration

For more information, see http://gethue.com/additional-sql-improvements-in-hue-4-3/.

Apache Impala

Enhancements in Authorization

Fine-grained Privileges

Sentry and Impala introduced fine-grained privileges to provide object-level privileges to roles.

Fine-grained privileges include the REFRESH and CREATE privileges, which allow users to create databases and tables, and to execute commands that update metadata information on Impala databases and tables. See Impala Sentry documentation for the new privileges and the scopes of the objects that you can grant the new privileges on.

The following new privileges were added:

  • The REFRESH privilege
  • The CREATE privilege
  • The SELECT and INSERT privileges on SERVER

If a role has SELECT or INSERT privilege on an object in Impala before upgrading to CDH 6.1, that role will automatically get the REFRESH privilege during the upgrade.

Object Ownership

Object ownership designates an owner for a database, table, or view in Sentry. The owner of an object has the OWNER privilege which is the equivalent of the ALL privilege on the object. See Object Ownership for information about enabling object ownership.

If the object ownership feature is enabled, Sentry grants the user the OWNER privilege. Whether or not object ownership is enabled, HMS stores the object creator as the default object owner. Previously, HMS stored the Kerberos user as the object owner.

The following statements were added in Impala to support object ownership via Sentry:

  • ALTER DATABASE SET OWNER
  • ALTER TABLE SET OWNER
  • ALTER VIEW SET OWNER
  • SHOW GRANT USER

Enhancements in Admission Control and Resource Management

The following is a list of noteworthy improvements made in Impala in resource management and admission control.

  • Starting in CDH 6.1 / Impala 3.1, Impala automatically chooses how much memory to give a query based on the memory estimate from the planner and bounded by the min/max guardrails that you configure for resource pools. In previous versions, you were required to set a single memory limit (via the mem_limit setting) per resource pool.

    The following new resource pool settings can be configured in Cloudera Manager or in the admission control configuration files:

    • Minimum Query Memory Limit (min-query-mem-limit)
    • Maximum Query Memory Limit (max-query-mem-limit)
    • Clamp MEM_LIMIT Query Option (clamp-mem-limit-query-option)

    See Admission Control and Query Queuing for detail information.

  • Improvements to prevent scan operators running out of memory in many more scenarios.
  • Many improvements for more accurate memory estimates in admission control.
  • New query options to reject complex queries.

    The options are enforced by admission control based on planner resource requirements and the schedule.

IANA Time Zones Support

Now you have an option to customize time zone databases in Impala with well-known sources, such as IANA.
  • The --hdfs_zone_info_zip startup flag specifies the path to a zip archive that contains the IANA time zone database. The default location of the time zone database is the /usr/share/zoneinfo folder. See Customizing Time Zones for more information and the steps to set the flag.
  • The --hdfs_zone_alias_conf startup flag specifies the path to a configuration file that contains definitions for non-standard timezone aliases. See Customizing Time Zones for more information and the steps to set the flag.
  • The new TIMEZONE Query Option (CDH 6.1 / Impala 3.1 or higher only) query option defines the local time zone to be used for conversions between UTC and the local time. By default, the coordinator node’s time zone is used as the local time zone. See TIMESTAMP Data Type for detail.

General Performance Improvements

A new query option, SHUFFLE_DISTINCT_EXPRS, controls the shuffling behavior when a query has both grouping and distinct expressions.

Metadata Performance Improvements

  • Incremental Stats

    The following enhancements improve Impala stability. The features reduce chances of having catalogd and impalad crash due to be out of memory when using incremental stats.

    • Incremental stats are now compressed in memory in catalogd, reducing memory footprint in catalogd.
    • Incremental stats are fetched on demand from catalogd by impalad coordinators. This enhancement reduces memory footprint for impalad coordinators and statestored and also reduces network requirements to broadcast metadata.

    See impala_perf_stats.html#pull_incremental_statistics for details.

  • Automatic Invalidation of Metadata

    To keep the size of metadata bounded and to reduce the chances of catalogd cache running out of memory, this release introduces an automatic metadata invalidation feature with time-based and memory-based invalidation.

    Automatic invalidation of metadata provides more stability with lower chances of running out of memory, but could potentially cause performance risks. The feature is turned off by default.

    See impala_config_options.html#auto_invalidate_metadata for details.

Compatibility and Usability Enhancements

  • Additional separators are supported between date and time in default TIMESTAMP format, specifically, the multi-space separator and the 'T' separator. See TIMESTAMP Data Type for more information on TIMESTAMP format.

  • New hint placement is supported for INSERT statements. See Optimizer Hints in Impala for detail.

  • The REGEX_ESCAPE() function was implemented for escaping special characters to treat them literally in string literals.

  • SHOW CREATE VIEW was implemented with the same functionality as SHOW CREATE TABLE.

  • The SHUTDOWN Statement SQL command was implemented for a graceful shutdown of Impala.
  • A query can contain multiple DISTINCT operators.
  • Impala Shell can connect directly to impalad when configured with proxy load balancer and Kerberos. See impala-shell Configuration Options for the new flag that enables the direct connection.
  • Impala can read and write data in Azure Data Lake Storage Gen2.

    By default, TLS is enabled when ADLS Gen2 is accessed via HTTP and HTTPS.

Apache Kafka

The following are some of the notable new features in this release of Kafka CDH 6.1.0.

Rebase on Apache Kafka 2.0.0

The Kafka version in CDH 6.1.0 is based on Apache Kafka 2.0.0.

Apache Kafka 2.0.0 provides a number of improvements including:
  • An improved replication protocol that lessens log divergence between leader and follower during fast leader failover.
  • An improved and reworked controller.
  • Support for more partitions per cluster.
  • Incremental fetch requests, which improves replication for large partitions.
  • A new configuration option for the Kafka consumer to avoid indefinite blocking.

For upstream release notes, see Apache Kafka version 1.0.2, 1.1.0, 1.1.1, and 2.0.0 release notes.

New Metrics

A high number of new metrics are introduced for Kafka. The following list is only a summary, for full list of metrics, see Metrics Reference.

Broker Metrics related to the following:

  • Controller State
  • Global Partition Count
  • Global Topic Count
  • Kafka Log Cleaner
  • Auto Leader Balance Rate and Time
  • Controlled Shutdown Rate and Time
  • Controller Change Rate and Time
  • ISR Change Rate and Time
  • Leader and ISR Response Received Rate and Time
  • Log Dir Change Rate and Time
  • Manual Leader Balance Rate and Time
  • Partition Reassignment Rate and Time
  • Topic Change Rate and Time
  • Topic Deletion Rate and Time

Broker Topic Metrics related to the following:

  • Fetch Message Conversion
  • Produce Message Conversion
  • Incoming Replication rate
  • Outgoing Replication Rate
  • Total Fetch Requests per Second
  • Total Produce Requests per Second

Replica Metrics related to the following:

  • Failed ISR Updates
  • Offline Replica Count
  • Under Min ISR Partition Count

JBOD Support

As of CDH 6.1.0, Cloudera officially supports Kafka clusters with nodes using JBOD configurations.

JBOD support introduces a new command line tool and improves an existing tool:
  • A new tool, kafka-log-dirs, is added. The tool allows users to query partition assignment information.
  • The kafka-reassign-partitions tool is expanded with a new functionality that allows users to reassign partitions between log directories. Users can move partitions to a different log directory on the same broker as well as to log directories on other brokers.

Security Improvements

Dependencies for third-party libraries containing security vulnerabilities are updated. Kafka in CDH 6.1.0 is shipped with third-party libraries that do not contain any known security vulnerabilities.

The properties required for enabling remote JMX authentication on Kafka brokers are available in Cloudera Manager. Users are no longer required to carry out setup through a command line interface.

Default Behavior Changes

KAFKA-7050: The default value for request.timeout.ms is decreased to 30 seconds. In addition, a new logic is added that makes the JoinGroup requests ignore this timeout.

Apache Kudu

The following list describes new features in Apache Kudu for CDH 6.1.0:
  • Examples showcasing functionality in C++, Java, and Python, previously hosted in a separate repository have been added. They can be found in the examples/ top-level subdirectory.
  • Added kudu diagnose parse_stacks, a tool to parse sampled stack traces out of a diagnostics log. See KUDU-2353.
  • Added support for IS NULL and IS NOT NULL predicates to the Kudu Python client. See KUDU-2399.
  • Introduced manual data rebalancer into the Kudu CLI tool. The rebalancer can be used to redistribute table replicas among tablet servers. The rebalancer can be run via kudu cluster rebalance sub-command. Using the new tool, it's possible to rebalance Kudu clusters of version 1.4.0 and newer.
  • Added kudu tserver get_flags and kudu master get_flags, two tools that allow superusers to retrieve all the values of command line flags from remote Kudu processes. The get_flags tools support filtering the returned flags by tag, and by default will return only flags that were explicitly set.
  • Added kudu tablet unsafe_replace_tablet, a tool to replace a tablet with a new one. This tool is meant to be used to recover a table when one of its tablets has permanently lost all replicas. The data in the tablet that is replaced is lost, so this tool should only be used as a last resort. See KUDU-2290.

    The following list describes optimizations and improvements in Apache Kudu for CDH 6.1.0:

  • There is a new metric for each tablet replica tracking the number of election failures since the last successful election attempt and the time since the last heartbeat from the leader. See KUDU-2287.
  • Kudu now supports building and running on Ubuntu 18.04 (“Bionic Beaver”). See KUDU-2427.
  • Kudu now supports building and running against OpenSSL 1.1. See KUDU-1889.
  • Added Kerberos support to the Kudu Flume sink. See KUDU-2012.
  • The Kudu Spark connector now supports Spark Streaming DataFrames. See KUDU-2539.
  • Added -tables filtering argument to kudu table list. See KUDU-2529.
  • Clients now support setting a limit on the number of returned rows in scans. See KUDU-16.
  • Added Pandas support to the Python client. See KUDU-1276.
  • Enabled configuration of mutation buffer in the Python client. See KUDU-2441.
  • Added a keepAlive API call to the KuduScanner and AsyncKuduScanner in the Java client. This API can be used to keep the scanners alive on the server when processing of messages will take longer than the scanner TTL. See KUDU-2095.
  • The Kudu Spark integration now uses the keepAlive API when reading data. By default it will call keepAlive on a scanner with a period of 15 seconds. This will ensure that Spark jobs with large batch sizes or slow processing times do not fail with scanner not found errors. See KUDU-2563.
  • Number of reactor threads in the C++ client is now configurable. See KUDU-2368.
  • Added an optimization to avoid bottlenecks on getpwuid_r() in libnss during a Raft leader election storm. See KUDU-2395.
  • Improved rowset tree pruning making scans with open-ended intervals on primary key. See KUDU-2566.
  • The kudu perf loadgen toll now supports generating range-partitioned tables. The -table_num_buckets configuration is now removed in favor of -table_num_hash_partitions and -table_num_range_partitions. See KUDU-1861.
  • CFile checksum failures will now cause the affected tablet replicas to be failed and re-replicated elsewhere. See KUDU-2469.
  • Servers are now able to start up with data directories missing on disk. See KUDU-2359.
  • The kudu perf loadgen tool now creates tables with a period-separated database name, for example: default.loadgen_auto_abc123. This new behavior does not take effect if the --table flag is provided. The database of the table can be changed using a new --auto_database flag. This change is made in anticipation of an eventual Kudu/HMS integration. See KUDU-2191.
  • Introduced FAILED_UNRECOVERABLE replica health status. This is to mark replicas which are not able to catch up with the leader due to GC-collected segments of WAL and other unrecoverable cases like disk failure. With that, the replica management scheme becomes hybrid: the system evicts replicas with FAILED_UNRECOVERABLE health status before adding a replacement if it anticipates that it can commit the transaction, while in other cases it first adds a non-voter replica and removes the failed one only after promoting a newly-added replica to voter role.
  • Two additional configuration parameters, socketReadTimeoutMs and ScanRequestTimeout, have been added to the Spark connector to allow better tuning to avoid scan timeouts under high load.
  • The kudu table tool now supports two new options to rename tables and columns: rename_table and rename_column.
  • Kudu will now wait for the clock to become synchronized at startup, controlled by the new flag -ntp_initial_sync_wait_secs. See KUDU-2242.
  • Tablet deletions are now throttled, which will help Kudu clusters remain stable even when many tablets are deleted at once. The number of tablets that a tablet server will delete at once is controlled by the new flag -num_tablets_to_delete_simultaneously. See KUDU-2289.
  • The kudu cluster ksck tool has been significantly enhanced. It now checks master health and consensus status, displays any unsafe or hidden flags set in the cluster, and produces a summary of the Kudu versions running on the master and tablet servers. In addition, it now supports JSON output, both in pretty-printed and compact form. The output format is controlled by the -ksck_format flag.

Apache Oozie

There are no notable new features in this release.

Apache Parquet

There are no notable new features in this release.

Apache Pig

There are no notable new features in this release.

Cloudera Search

In CDH 6.1 Cloudera Search is rebased on Apache Solr 7.4, which has added new features since the 7.0 version of Apache Solr used in the CDH 6.0 release.

Some features included in Apache Solr 7.4 are not supported in Cloudera Search in CDH 6.1. For more information, see Cloudera Search Unsupported Features.

For detailed information on the new features added in Solr 7.4, see the Apache Solr 7.4 Release Notes.

Changes in Configuration Structure

  • The top-level <highlighting> element in solrconfig.xml is now officially deprecated in favor of the equivalent <searchComponent> syntax. This element has been out of use in default Solr installations for several releases already.
  • Shard and cluster metric reporter configuration now require a class attribute.
    • If a reporter configures the group="shard" attribute, also configure the class="org.apache.solr.metrics.reporters.solr.SolrShardReporter" attribute.
    • If a reporter configures the group="cluster" attribute, also configure the class="org.apache.solr.metrics.reporters.solr.SolrClusterReporter" attribute.

    See Shard and Cluster Reporters in the Apache Solr Reference Guide for more information.

Changes in Default Configuration Values

  • The default value of autoReplicaFailoverWaitAfterExpiration, used with the AutoAddReplicas feature, has increased to 120 seconds from the previous default of 30 seconds. This affects how soon Solr adds new replicas to replace the replicas on nodes that have either crashed or shutdown.
  • The default Solr log file size have been raised to 32MB, and the number of backups is now 10. See Configuring Logging for more information on how to change this default logging configuration.
  • The eDisMax parser, by default, doesn't allow subqueries that specify a Solr parser using either local parameters, or the older _query_ magic field trick.

    For example, {!prefix f=myfield v=enterp} or _query_:"{!prefix f=myfield v=enterp}" are not supported by default. If you want to allow power-users to do this, set uf=* query or some other value that includes _query_.

    If you need full backwards compatibility for the time being, use luceneMatchVersion=7.1.0 or an earlier version.

  • In the XML query parser (defType=xmlparser or {!xmlparser …​ }), the resolving of external entities is now disallowed by default.

Changes in Default Behavior

  • Configuring slowQueryThresholdMillis now logs slow requests to a separate file named solr_slow_requests.log. Earlier, slow requests were logged in the solr.log file.
  • In the leader-follower model of scaling Solr, a follower no longer commits an empty index when a completely new index is detected on the leader during replication. To resume earlier behavior, pass false to skipCommitOnMasterVersionZero in the follower section of replication handler configuration, or pass it to the fetchindex command.
  • Collections created without specifying a configset name use a copy of the _default configset since Solr 7.0. Before 7.3, the copied configset was named the same as the collection name. From 7.3 onwards, it is named with a new ".AUTOCREATED" suffix to prevent overwriting custom configset names.
  • The rq parameter used with Learning to Rank rerank query parsing no longer considers the defType parameter. See Running a Rerank Query for more information about this parameter.
  • Replicas that are not up-to-date are no longer allowed to become leader. Use the FORCELEADER command of the Collections API to allow these replicas become leader.
  • The behaviour of the autoscaling system now pauses all triggers from execution between the start of actions and the end of a cool down period. The triggers will resume after the cool down period expires. Earlier, the cool down period was a fixed period that started after actions for a trigger event got completed. During this time all triggers continued to run, but any events were rejected and tried later.
  • Starting a query string with local parameters {!myparser …​} is used to switch from one query parser to another. It is intended to be used by Solr system developers, not end users doing searches. To reduce negative side-effects of unintended hackability, Solr now limits the cases when local parameters is parsed to contexts in which the default parser is "lucene" or "func".
    • If defType=edismax, q={!myparser …​} doesn’t work. In this example, put the desired query parser into the defType parameter.
    • If deftype=edismax, hl.q= {!myparser …​} doesn’t work. In this example, either put the desired query parser into the hl.qparser parameter or set hl.qparser=lucene.
  • The feature to add replicas automatically if a replica goes down was earlier available only when storing indexes in HDFS. It has been ported to the autoscaling framework, and AutoAddReplicas is now available to all users even if their indexes are on local disks.
  • Changing the autoAddReplicas property from disabled (false) to enabled (true) using MODIFYCOLLECTION API no longer replaces down replicas for the collection immediately. Instead, replicas are only added if a node containing them went down while AutoAddReplicas was enabled. The parameters autoReplicaFailoverBadNodeExpiration and autoReplicaFailoverWorkLoopDelay are no longer used.
  • All Stream Evaluators in solrj.io.eval have been refactored to have a simpler and more robust structure. This simplifies and condenses the code required to implement a new Evaluator and makes it much easier for evaluators to handle different data types (primitives, objects, arrays, lists, and so forth).

Apache Sentry

The following new features have been added to Apache Sentry in CDH 6.1.0:

Fine Grained Privileges

The CREATE and REFRESH (Impala only) privileges have been introduced to allow users to create databases, tables and functions, and to execute commands that update metadata information on Impala databases and tables.

For more information about the new privileges, see Sentry Privileges.

Object Ownership

Object ownership designates an owner for a database, table, or view in Sentry. The owner of an object has the equivalent of the ALL privilege on the object.

In CDH 6.1.0, object ownership is enabled by default with a new CDH installation. For information about enabling and using object ownership, see Object Ownership.

No Group Name Case Restrictions

Sentry no longer normalizes group name characters to be lowercase. Therefore, operating system group names do not need to be treated as case insensitive. In previous versions, Sentry modified capital letters in operating system group names to be lowercase.

Apache Spark

The following list describes what's new and changed in Apache Spark for CDH 6.1.0, which is based on Apache Spark 2.4 upstream version:

Apache Sqoop

The following new features have been added to Apache Sqoop in CDH 6.1.0:

Incremental Import NULL Column Updates into HBase

This feature implements the --hbase-null-incremental-mode option for the sqoop-import tool, which allows users to specify how NULL column updates are handled during incremental imports. For more information, see Importing Data Into HBase Using Sqoop.

Automatic Compile Directory Clearing

This feature implements the --delete-compile-dir option for the sqoop-import tool, which enables users to automatically delete the generated class and jar files from the disk after the job finishes.

By default all temporary files generated by the ClassWriter are left behind on disk in the /tmp/sqoop-username/compile directory. Because the table schema can be extracted from these files, Cloudera recommends that you use the --delete-compile-dir option to delete these files.

Parquet Hadoop API Based Implementation for Importing Data Into Parquet Format

Support for the Hadoop API based implementation for importing data into Parquet has been added. This feature implements a new option, --parquet-configurator-implementation, which allows users to specify which implementation used for importing data into Parquet files. For more information, see Importing Data into Parquet Format Using Sqoop.

HiveServer2 Support

This feature implements support for importing data into Hive through HiveServer2.

This feature adds three new options to the sqoop import tool:
  • --hs2-url
  • --hs2-user
  • --hs2-keytab

The feature does not introduce any changes to the default behavior of Hive imports. When the user specifies the --hs2-url option, commands are sent to HiveServer2 through a JDBC connection. The data itself is not transferred via the JDBC connection. It is written directly to HDFS and moved to the Hive warehouse using the LOAD DATA INPATH command just like in the case of the default Hive import.

HiveServer2 provides proper Sentry authorization. As a result, Cloudera recommends importing data into Hive through HiveServer2 instead of the default method. Currently, Sqoop can authenticate to HiveServer2 using Kerberos only.

For more information, see Importing Data into Hive with Sqoop Through HiveServer2.

Support for Import into Amazon S3

Sqoop now supports import from RDBMS into Amazon S3 exploiting the capabilities of the Hadoop-Amazon Web Services integration. For more information about the Hadoop-AWS module, see Hadoop-AWS module: Integration with Amazon Web Services.

Default Precision and Scale in Avro Import

Support to specify a default precision and scale to be used in the avro schema when a table contains numeric data in Oracle, or numeric or decimal data in Postgres, has been added. This feature implements two new properties, sqoop.avro.logical_types.decimal.default.precision and sqoop.avro.logical_types.decimal.scale to specify the default precision and scale. For more information about Importing Avro in Sqoop, see Importing Avro Data Files in Sqoop.

Behavior Changes

MS SQL Connector Concerning Connection Resets

The recovery logic of the MS-SQL connector proved to be unreliable; therefore, the default behavior was changed from resilient to non-resilient. In other words, the recovery logic is now turned off by default.

The recovery logic can be turned on with the --resilient option.

The --non-resilient option, which was previously used to turn the recovery logic off, is now ignored.

The resilient operation of the MS-SQL connector requires the split-by column to contain unique values in ascending order only. Otherwise, using the --resilient option can lead to duplicate or missing records in the output.

Examples

Importing from a table:
sqoop import ... --table custom_table --split-by id -- --resilient
Importing via a query:
sqoop import ... --query "SELECT ... WHERE $CONDITIONS" --split-by ordered_column -- --resilient

Apache Zookeeper

The following list shows what's new and changed in Apache Zookeeper for CDH 6.1.0:
  • A new metric is available to let you monitor the size of generated responses to see how to set the client's jute.maxbuffer property correctly. For more information, see ZOOKEEPER-2940.
  • A new metric is available to track the number of slow fsyncs. For more information, see ZOOKEEPER-3019.
  • A tool has been added to recover log and snapshot entries with CRC errors . For more information, see ZOOKEEPER-2994.