Incompatible Changes in CDH 6.0.0

Apache Accumulo

Running Apache Accumulo on top of a CDH 6.0.0 cluster is not currently supported. If you try to upgrade to CDH 6.0.0 you will be asked to remove the Accumulo service from your cluster. Running Accumulo on top of CDH 6 will be supported in a future release.

Apache Avro

API Changes

One method was removed in CDH 6.0.0:
GenericData.toString (Object datum, StringBuilder buffer)

Incompatible Changes from Avro 1.8.0

  • Changes in logical types cause code generated in Avro with CDH 6 to differ from code generated in Avro with CDH 5. This means that old generated code will not necessarily work in CDH 6. Cloudera recommends that users regenerate their generated Avro code when upgrading.
  • AVRO-997: Generic API requires GenericEnumSymbol - likely to break current Generic API users that often have String or Java Enum for these fields
  • AVRO-1502: Avro Objects now Serializable - IPC needs to be regenerated/recompiled
  • AVRO-1602: removed Avro internal RPC tracing, presumed unused. Current rec would be HTrace
  • AVRO-1586: Compile against Hadoop 2 - probably not an issue since we’ve been compiling against Hadoop 2 for C5.
  • AVRO-1589: [Java] ReflectData.AllowNulls will create incompatible Schemas for primitive types - may need a KI since it used to fail at runtime but now will fail earlier.

Cloudera Data Science Workbench

Cloudera Data Science Workbench (1.4.x and lower) is not currently supported with CDH 6.0.x. If you try to upgrade to CDH 6.0.x, you will be asked to remove the CDSW service from your cluster. Cloudera Data Science Workbench will be supported with CDH 6 in a future release.

Cloudera Issue: DSE-2769

Apache Crunch

The following changes are introduced in CDH 6.0.0, and are not backward compatible:

  • Crunch is available only as Maven artifacts from the Cloudera Maven repository. It is not included as part of CDH. For more information, see Apache Crunch Guide.
  • Crunch supports only Spark 2 and higher releases.
  • Crunch supports only HBase 2 and higher releases.
    • The API methods in Crunch-HBase use HBase 2 API types and methods.

Apache Flume

AsyncHBaseSink and HBaseSink

CDH 6 uses HBase 2.0. AsyncHBaseSink is incompatible with HBase 2.0 and is not supported in CDH 6. HBaseSink has been replaced with HBase2Sink. HBase2Sink works the same way as HBaseSink. The only difference is that it is compatible with HBase 2.0. No additional configuration is required when HBase2Sink is used, but you can replace the component type in your configuration.

For example, replace this text:
agent.sinks.my_hbase_sink.type = hbase
With this:
agent.sinks.my_hbase_sink.type = hbase2
Or, if you use the FQN of the sink class, replace this text:
agent.sinks.my_hbase_sink.type = org.apache.flume.sink.hbase.HBaseSink
With this:
agent.sinks.my_hbase_sink.type = org.apache.flume.sink.hbase2.HBase2Sink

For more information about how to configure HBase2Sink, see Importing Data Into HBase.

For more information about the use of legacy names, see Serializer Class Names.

com.google.common.collect.ImmutableMap

Flume has removed com.google.common.collect.ImmutableMap from the org.apache.flume.Context API and replaced it with java.util.Map due to Guava compatibility issues (FLUME-2957). Plugins using the Context.getParameters() and Context.getSubProperties() APIs will need to assign the return value of those methods to a Map<String, String> variable instead of an ImmutableMap<String, String> variable, if they do not already do so. Most usages in the Flume codebase already used Map<String, String> at the time of this change.

Apache Hadoop

HDFS Incompatible Changes

  • HFTP has been removed.
  • The S3 and S3n connectors have been removed. Users should now use the S3a connector.
  • The BookkeeperJournalManager has been removed.
  • Changes were made to the structure of the HDFS JAR files to better isolate clients from Hadoop library dependencies. As a result, client applications that depend on Hadoop's library dependencies may no longer work. In these cases, the client applications will need to include the libraries as dependencies directly.
  • Several library dependencies were upgraded. Clients that depend on those libraries may break because the library version changes. In these cases, the client applications will need to either be ported to the new library versions or include the libraries as dependencies directly.
  • HDFS-6962 changes the behavior of ACL inheritance to better align with POSIX ACL specifications, which states that the umask has no influence when a default ACL propagates from parent to child. Previously, HDFS ACLs applied the client's umask to the permissions when inheriting a default ACL defined on a parent directory. Now, HDFS can ignore the umask in these cases for improved compliance with POSIX. This behavior is on by default due to the inclusion of HDFS-11957. It can be configured by settingdfs.namenode.posix.acl.inheritance.enabled in hdfs-site.xml. See the Apache Hadoop HDFS Permissions Guide for more information.
  • HDFS-11957 changes the default behavior of ACL inheritance introduced by HDFS-6962. Previously, the behavior was disabled by default. Now, the feature is enabled by default. Any code expecting the old ACL inheritance behavior will have to be updated. See the Apache Hadoop HDFS Permissions Guide for more information.
  • HDFS-6252 removed dfshealth.jsp since it is part of the old NameNode web UI. By default, Cloudera Manager links to the new NameNode web UI, which has an equivalent health page at dfshealth.html.
  • HDFS-11100 changes the behavior of deleting files protected by a sticky bit. Now, the deletion fails.
  • HDFS-10689 changes the behavior of the hdfs dfs chmod command. Now, the command resets sticky bit permission on a file/directory when the leading sticky bit is omitted in the octal mode (like 644). When a file or directory permission is applied using octal mode and sticky bit permission needs to be preserved, then it has to be explicitly mentioned in the permission bits (like 1644).
  • HDFS-10650 changes the behavior of DFSClient#mkdirs and DFSClient#primitiveMkdir. Previously, they create a new directory with the default permissions 00666. Now, they will create a new directory with permission 00777.
  • HADOOP-8143 changes the default behavior of distcp. Previously, the -pb option was not used by default, which may have caused some checksums to fail when block sizes did not match. Now, the -pb option is included by default to preserve block size when using distcp.
  • HADOOP-10950 changes several heap management variables:
    • HADOOP_HEAPSIZE variable has been deprecated. Use HADOOP_HEAPSIZE_MAX and HADOOP_HEAPSIZE_MIN instead to set Xmx and Xms
    • The internal variable JAVA_HEAP_MAX has been removed.
    • Default heap sizes have been removed. This will allow for the JVM to use auto-tuning based upon the memory size of the host. To re-enable the old default, configure HADOOP_HEAPSIZE_MAX="1g" in hadoop-env.sh.
    • All global and daemon-specific heap size variables now support units. If the variable is only a number, the size is assumed to be in megabytes.
  • HADOOP-14426 upgrades the version of Kerby from 1.0.0-RC2 to 1.0.0
  • HDFS-10970 updates the version of Jackson from 1.9.13 to 2.x in hadoop-hdfs.
  • HADOOP-9613 updates the Jersey version to the latest 1.x release.
  • HADOOP-10101 updates Guava dependency to 21.0
  • HADOOP-14225 removes the xmlenc dependency. If you rely on the transitive dependency, you need to set the dependency explicitly in your code after this change.
  • HADOOP-13382 remove unneeded commons-httpclient dependencies from POM files in Hadoop and sub-projects. This incompatible change may affect projects that have undeclared transitive dependencies on commons-httpclient, which used to be provided by hadoop-common or hadoop-client.
  • HADOOP-13660 upgrades the commons-configuration version from 1.6 to 2.1.
  • HADOOP-12064 upgrades the following dependencies:
    • Guice from 3.0 to 4.0
    • cglib from 2.2 to 3.2.0
    • asm from 3.2 to 5.0.4

MapReduce

  • Support for MapReduce v1 has been dropped from CDH 6.0.0.
  • CDH 6 supports applications compiled against CDH 5.7.0 and higher MapReduce frameworks. Make sure to not to include the CDH jars with your application by marking them as "provided" in the pom.xmlfile.

YARN

There are no incompatible changes in this release.

Apache HBase

CDH 6.0.0 contains the following downstream HBase incompatible change:

hbase.security.authorization

The default value for hbase.security.authorization has been changed from true to false. Secured clusters should make sure to explicitly set it to true in XML configuration file before upgrading to one of these versions (HBASE-19483). True as the default value of hbase.security.authorization was changed because not all clusters need authorization. (History: HBASE-13275) Rather, only the clusters which need authorization should set this configuration as true.

Incompatible Changes

For more information about upstream incompatible changes, see the Apache Reference Guide Incompatible Changes and Upgrade Paths.

CDH 6.0.0 contains the following upstream HBase incompatible changes:

  • Public interface API changes:
  • HBASE-18792: hbase-2 needs to defend against hbck operations
  • HBASE-15982: Interface ReplicationEndpoint extends Guava's Service.
  • HBASE-18995: Split CellUtil into public CellUtil and PrivateCellUtil for Internal use only.
  • HBASE-19179: Purged the hbase-prefix-tree module and all references from the code base.
  • HBASE-17595: Add partial result support for small/limited scan; Now small scan and limited scan could also return partial results.
  • HBASE-16765: New default split policy, SteppingSplitPolicy.
  • HBASE-17442: Move most of the replication related classes from hbase-client to hbase-replication package.
  • HBASE-16196: The bundled JRuby 1.6.8 has been updated to version 9.1.9.0.
  • HBASE-18811: Filters have been moved from Public to LimitedPrivate.
  • HBASE-18697: Replaced hbase-shaded-server jar with hbase-shaded-mapreduce jar.
  • HBASE-18640: Moved mapreduce related classes out of hbase-server into separate hbase-mapreduce jar .
  • HBASE-19128: Distributed Log Replay feature has been removed.
  • HBASE-19176: Hbase-native-client has been removed.
  • HBASE-17472: Changed semantics of granting new permissions. Earlier, new grants would override previous permissions, but now, the new and existing permissions get merged.
  • HBASE-18374: Previous "mutate" latency metrics has been renamed to "put" metrics.
  • HBASE-15740: Removed Replication metric source.shippedKBs in favor of source.shippedBytes.
  • HBASE-13849: Removed restore and clone snapshot from the WebUI.
  • HBASE-13252: The concept of managed connections in HBase (deprecated before) has now been extinguished completely, and now all callers are responsible for managing the lifecycle of connections they acquire.
  • HBASE-14045: Bumped thrift version to 0.9.2.
  • HBASE-5401: Changes to number of tasks PE runs when clients are mapreduce. Now tasks == client count. Previous we hardcoded ten tasks per client instance.

Changed Behavior

CDH 6.0.0 contains the following HBase behavior changes:

  • HBASE-14350: Assignment Manager v2 - Split/Merge have moved to the Master; it runs them now. Hooks around Split/Merge are now noops. To intercept Split/Merge phases, CPs need to intercept on MasterObserver.
  • HBASE-18271: Moved to internal shaded netty.
  • HBASE-17343: Default MemStore to be CompactingMemStore instead of DefaultMemStore. In-memory compaction of CompactingMemStore demonstrated sizable improvement in HBase’s write amplification and read/write performance.
  • HBASE-19092: Make Tag IA.LimitedPrivate and expose for CPs.
  • HBASE-18137: Replication gets stuck for empty WALs.
  • HBASE-17513: Thrift Server 1 uses different QOP settings than RPC and Thrift Server 2 and can easily be misconfigured so there is no encryption when the operator expects it.
  • HBASE-16868: Add a replicate_all flag to replication peer config. The default value is true, which means all user tables (REPLICATION_SCOPE != 0 ) will be replicated to peer cluster.
  • HBASE-19341: Ensure Coprocessors can abort a server.
  • HBASE-18469: Correct RegionServer metric of totalRequestCount.
  • HBASE-17125: Marked Scan and Get's setMaxVersions() and setMaxVersions(int) as deprecated. They are easy to misunderstand with column family's max versions, so use readAllVersions() and readVersions(int) instead.
  • HBASE-16567: Core is now up on protobuf 3.1.0 (Coprocessor Endpoints and REST are still on protobuf 2.5.0).
  • HBASE-14004: Fix inconsistency between Memstore and WAL which may result in data in remote cluster that is not in the origin (Replication).
  • HBASE-18786: FileNotFoundException opening a StoreFile in a primary replica now causes a RegionServer to crash out where before it would be ignored (or optionally handled via close/reopen).
  • HBASE-17956: Raw scans will also read TTL expired cells.
  • HBASE-17017: Removed per-region latency histogram metrics.
  • HBASE-19483: Added ACL checks to RSGroup commands - On a secure cluster, only users with ADMIN rights will be able to execute RSGroup commands.
  • HBASE-19358: Added ACL checks to RSGroup commands (HBASE-19483): On a secure cluster, only users with ADMIN rights will be able to execute RSGroup commands. Improved stability of splitting log when do failover.
  • HBASE-18883: Updated our Curator version to 4.0 - Users who experience classpath issues due to version conflicts are recommended to use either the hbase-shaded-client or hbase-shaded-mapreduce artifacts.
  • HBASE-16388: Prevent client threads being blocked by only one slow region server - Added a new configuration to limit the max number of concurrent request to one region server.
  • HBASE-15212: New configuration to limit RPC request size to protect the server against very large incoming RPC requests. All requests larger than this size will be immediately rejected before allocating any resources.
  • HBASE-15968: This issue resolved two long-term issues in HBase: 1) Puts may be masked by a delete before them, and 2) Major compactions change query results. Offers a new behavior to fix this issue with a little performance reduction. Disabled by default. See the issue for details and caveats.
  • HBASE-13701: SecureBulkLoadEndpoint has been integrated into HBase core as default bulk load mechanism. It is no longer needed to install it as a coprocessor endpoint.
  • HBASE-9774: HBase native metrics and metric collection for coprocessors.
  • HBASE-18294: Reduce global heap pressure: flush based on heap occupancy.

Apache Hive/Hive on Spark/HCatalog

Continue reading:

Apache Hive

Changing Table File Format from ORC with the ALTER TABLE Command Not Supported in CDH 6

Changing the table file format from ORC to another file format with the ALTER TABLE command is not supported in CDH 6 (it returns an error).

UNION ALL Statements Involving Data Types from Different Type Groups No Longer Use Implicit Type Casting

Prior to this change, Hive performed implicit casts when data types from different type groups were specified in queries that use UNION ALL. For example, before CDH 6.0, if you had the two following tables:

Table "one"

+------------+------------+------------+--+
| one.col_1  | one.col_2  | one.col_3  |
+------------+------------+------------+--+
| 21         | hello_all  | b          |
+------------+------------+------------+--+
        

Where col_1 datatype is int, col_2 datatype is string, and col_3 datatype is char(1).

Table "two"

+------------+------------+------------+--+
| two.col_4  | two.col_5  | two.col_6  |
+------------+------------+------------+--+
| 75.0       | abcde      | 45         |
+------------+------------+------------+--+
        

Where col_4 datatype is double, col_5 datatype is varchar(5), and col_6 datatype is int.

And you ran the following UNION ALL query against these two tables:

SELECT * FROM one UNION ALL SELECT col_4 AS col_1, col_5 AS col_2, col_6 AS
col_3 FROM two;
        

You received the following result set:

+------------+------------+------------+--+
| _u1.col_1  | _u1.col_2  | _u1.col_3  |
+------------+------------+------------+--+
| 75.0       | abcde      | 4          |
| 21.0       | hello      | b          |
+------------+------------+------------+--+
        

Note that this statement implicitly casts the values from table one with the following errors resulting in data loss:

  • one.col_1 is cast to a double datatype
  • one.col_2 is cast to a varchar(5) datatype, which truncates the original value from hello_all to hello
  • one.col_3 is cast to a char(1) datatype, which truncates the original value from 45 to 4

In CDH 6.0, no implicit cast is performed across different type groups. For example, STRING, CHAR, and VARCHAR are in one type group, and INT, BIGINT, and DECIMAL are in another type group, and so on. So, in CDH 6.0 and later, the above query that uses UNION ALL, returns an exception for the columns that contain datatypes that are not part of a type group. In CDH 6.0 and later, Hive performs the implicit cast only within type groups and not across different type groups. For more information, see HIVE-14251.

Support for UNION DISTINCT

Support has been added for the UNION DISTINCT clause in HiveQL. See HIVE-9039 and the Apache wiki for more details. This feature introduces the following incompatible changes to HiveQL:

  • Behavior in CDH 5:

    • SORT BY, CLUSTER BY, ORDER BY, LIMIT, and DISTRIBUTE BY can be specified without delineating parentheses either before a UNION ALL clause or at the end of the query, resulting in the following behaviors:

      • When specified before, these clauses are applied to the query before UNION ALL is applied.
      • When specified at the end of the query, these clauses are applied to the query after UNION ALL is applied.
    • The UNION clause is equivalent to UNION ALL, in which no duplicates are removed.
  • Behavior in CDH 6:

    • SORT BY, CLUSTER BY, ORDER BY, LIMIT, and DISTRIBUTE BY can be specified without delineating parentheses only at the end of the query, resulting in the following behaviors:

      • These clauses are applied to the entire query.
      • Specifying these clauses before the UNION ALL clause results in a parsing error.
    • The UNION clause is equivalent to UNION DISTINCT, in which all duplicates are removed.

OFFLINE and NO_DROP Options Removed from Table and Partition DDL

Support for Hive table and partition protection options have been removed in CDH 6.0, which includes removal of the following functionality:

  • Support has been removed for:

    • ENABLE | DISABLE NO_DROP [CASCADE]
    • ENABLE | DISABLE OFFLINE
    • ALTER TABLE … IGNORE PROTECTION
  • The following support has also been removed from the HiveMetastoreClient class:

    The ignoreProtection parameter has been removed from the dropPartitions methods in the IMetaStoreClient interface.

For more information, see HIVE-11145.

Cloudera recommends that you use Apache Sentry to replace most of this functionality. Although Sentry governs permissions on ALTER TABLE, it does not include permissions that are specific to a partition. See Authorization Privilege Model for Hive and Impala and Configuring the Sentry Service.

DESCRIBE Query Syntax Change

In CDH 6.0 syntax has changed for DESCRIBE queries as follows:

  • DESCRIBE queries where the column name is separated by the table name using a period is no longer supported:

    DESCRIBE testTable.testColumn;
                

    Instead, the table name and column name must be separated with a space:

    DESCRIBE testTable testColumn;
                
  • The partition_spec must appear after the table name, but before the optional column name:

    DESCRIBE default.testTable PARTITION (part_col = 100) testColumn;
                

For more details, see the Apache wiki and HIVE-12184.

CREATE TABLE Change: Periods and Colons No Longer Allowed in Column Names

In CDH 6.0, CREATE TABLE statements fail if any of the specified column names contain a period or a colon. For more information, see HIVE-10120 and the Apache wiki.

Reserved and Non-Reserved Keyword Changes in HiveQL

Hive reserved and non-reserved keywords have changed in CDH 6.0. Reserved keywords cannot be used as table or column names unless they are enclosed with back ticks (for example, `data`). Non-reserved keywords can be used as table or column names without enclosing them with back ticks. Non-reserved keywords have proscribed meanings in HiveQL, but can still be used as table or column names. For more information about the changes to reserved and non-reserved words listed below, see HIVE-6617 and HIVE-14872.

In CDH 6.0, the following changes have been introduced to Hive reserved and non-reserved keywords and are not backwards compatible:

Hive New Reserved Keywords Added in CDH 6.0

The following table contains new reserved keywords that have been added:

COMMIT CONSTRAINT DEC EXCEPT
FOREIGN INTERVAL MERGE NUMERIC
ONLY PRIMARY REFERENCES ROLLBACK
START
Hive Non-Reserved Keywords Converted to Reserved Keywords in CDH 6.0

The following table contains non-reserved keywords that have been converted to be reserved keywords:

ALL ALTER ARRAY AS
AUTHORIZATION BETWEEN BIGINT BINARY
BOOLEAN BOTH BY CREATE
CUBE CURSOR DATE DECIMAL
DOUBLE DELETE DESCRIBE DROP
EXISTS EXTERNAL FALSE FETCH
FLOAT FOR FULL GRANT
GROUP GROUPING IMPORT IN
INT INNER INSERT INTERSECT
INTO IS LATERAL LEFT
LIKE LOCAL NONE NULL
OF ORDER OUT OUTER
PARTITION PERCENT PROCEDURE RANGE
READS REGEXP REVOKE RIGHT
RLIKE ROLLUP ROW ROWS
SET SMALLINT TABLE TIMESTAMP
TO TRIGGER TRUNCATE UNION
UPDATE USER USING VALUES
WITH TRUE
Hive Reserved Keywords Converted to Non-Reserved Keywords in CDH 6.0

The following table contains reserved keywords that have been converted to be non-reserved keywords:

CURRENT_DATE CURRENT_TIMESTAMP HOLD_DDLTIME IGNORE
NO_DROP OFFLINE PROTECTION READONLY
Hive New Non-Reserved Keywords Added in CDH 6.0

The following table contains new non-reserved keywords that have been added:

ABORT AUTOCOMMIT CACHE DAY
DAYOFWEEK DAYS DETAIL DUMP
EXPRESSION HOUR HOURS ISOLATION
KEY LAST LEVEL MATCHED
MINUTE MINUTES MONTH MONTHS
NORELY NOVALIDATE NULLS OFFSET
OPERATOR RELY SECOND SECONDS
SNAPSHOT STATUS SUMMARY TRANSACTION
VALIDATE VECTORIZATION VIEWS WAIT
WORK WRITE YEAR YEARS
Hive Non-Reserved Keyword Removed in CDH 6.0

The following non-reserved keyword has been removed:

DEFAULT

Apache Hive API Changes in CDH 6.0.0

AddPartitionMessage.getPartitions() Can Return NULL

The getPartitions() method has been removed from the AddPartitionEvent class in the org.apache.hadoop.hive.metastore.events interface. It was removed to prevent out-of-memory errors when the list of partitions is too large.

Instead use the getPartitionIterator() method. For more information, see HIVE-9609 and the AddPartitionEvent documentation.

DropPartitionEvent and PreDropPartitionEvent Class Changes

The getPartitions() method has been removed and replaced by the getPartitionIterator() method in the DropPartitionEvent class and the PreDropPartitionEvent class.

In addition, the (Partition partition, boolean deleteData, HiveMetastore.HMSHandler handler) constructors have been deleted from the PreDropPartitionEvent class. For more information, see HIVE-9674 and the PreDropPartitionEvent documentation.

GenericUDF.getTimestampValue Method Now Returns Timestamp Instead of Date

The getTimestampValue method in the GenericUDF class now returns a TIMESTAMP value instead of a DATE value. For more information, see HIVE-10275 and the GenericUDF documentation.

GenericUDF.getConstantLongValue Has Been Removed

The getConstantLongValue method has been removed from the GenericUDF class. It has been noted by the community that this method is not used in Hive. For more information, see HIVE-10710 and the GenericUDF documentation.

Increased Width of Hive Metastore Configuration Columns

The columns used for configuration values in the Hive metastore have been increased in width, resulting in the following incompatible changes in the org.apache.hadoop.hive.metastore.api interface.

This change introduced an incompatible change to the get_table_names_by_filter method of the ThriftHiveMetastore class. Before this change, this method accepts a string filter, which allows clients to filter a table by its TABLEPROPERTIES value. For example:

org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.HIVE_FILTER_FIELD_
       PARAMS + "test_param_1 <> \"yellow\"";

org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.HIVE_FILTER_FIELD_
       PARAMS + "test_param_1 = \"yellow\"";
              

After this change, the TABLE_PARAMS.PARAM_VALUE column is now a CLOB data type. Depending on the type of database that you use (for example, MySQL, Oracle, or PostgresSQL), the semantics may have changed and operators like "=", "<>", and "!=" might not be supported. Refer to the documentation for your database for more information. You must use operators that are compatible with CLOB data types. There is no equivalent "<>" operator that is compatible with CLOB. So there is no equivalent operator for the above example that uses the "<>" inequality operator. The equivalent for "=" is the LIKE operator so you would rewrite the second example above as:

org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.HIVE_FILTER_FIELD_
        PARAMS + "test_param_1 LIKE \"yellow"";
          

For more information, see HIVE-12274.

Apache Hive Configuration Changes in CDH 6.0.0

Bucketing and Sorting Enforced by Default When Inserting Data into Hive Tables

The configuration properties hive.enforce.sorting and hive.enforce.bucketing have been removed. When set to false, these configurations disabled enforcement of sorted and bucketed tables when data was inserted into a table. Removing these configuration properties effectively sets these properties to true. In CDH 6.0, bucketing and sorting are enforced on Hive tables during insertions and cannot be turned off. For more information, see the Apache wiki topic on hive.enforce.bucketing and the topic on hive.enforce.sorting.

Hive Throws an Exception When Processing HDFS Directories Containing Unsupported Characters

Directories in HDFS can contain unprintable or unsupported characters that are not visible even when you run the hadoop fs -ls command on the directories. When external tables are created with the MSCK REPAIR TABLE command, the partitions using these HDFS directories that contain unsupported characters are unusable for Hive. To avoid this, the configuration parameter hive.msck.path.validation has been added. This configuration property controls the behavior of the MSCK REPAIR TABLE command, enabling you to set whether validation checks are run on the HDFS directories when MSCK REPAIR TABLE is run.

The property hive.msck.path.validation can be set to one of the following values:

Value Name Description
throw Causes Hive to throw an exception when it tries to process an HDFS directory that contains unsupported characters with the MSCK REPAIR TABLE command. This is the default setting for hive.msck.path.validation.
skip Causes Hive to skip the skip the directories that contain unsupported characters, but still repairs the others.
ignore Causes Hive to completely skip any validation of HDFS directories when the MSCK REPAIR TABLE command is run. This setting can cause bugs because unusable partitions are created.

By default, the hive.msck.path.validation property is set to throw, which causes Hive to throw an exception when MSCK REPAIR TABLE is run and HDFS directories containing unsupported characters are encountered. To work around this, set this property to skip until you can repair the HDFS directories that contain unsupported characters.

To set this property in Cloudera Manager:

  1. In the Admin Console, select the Hive service.
  2. Click the Configuration tab.
  3. Search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml setting.
  4. In the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml setting, add the Name of the property, the Value (throw, skip, or ignore), and a Description of the setting.
  5. Click Save Changes and restart the service.

For more information, see HIVE-10722.

Hive Strict Checks Have Been Re-factored To Be More Granular

Originally, the configuration property hive.mapred.mode was added to restrict certain types of queries from running. Now it has been broken down into more fine-grained configurations, one for each type of restricted query pattern. The configuration property hive.mapred.mode has been removed and replaced with the following configuration properties, which provide more granular control of Hive strict checks:

Configuration Property Description Default Value
hive.strict.checks.bucketing When set to true, running LOAD DATA queries against bucketed tables is not allowed. true. This is a backwards incompatible change.
hive.strict.checks.type.safety When set to true, comparing bigint to string data types or bigint to double data types is not allowed. true. This is a backwards incompatible change.
hive.strict.checks.orderby.no.limit When set to true, prevents queries from being run that contain an ORDER BY clause with no LIMIT clause. false
hive.strict.checks.no.partition.filter When set to true, prevents queries from being run that scan a partitioned table but do not filter on the partition column. false
hive.strict.checks.cartesian.product When set to true, prevents queries from being run that contain a Cartesian product (also known as a cross join). false

All of these properties can be set with Cloudera Manager in the following configuration settings for the Hive service:

  • Restrict LOAD Queries Against Bucketed Tables (hive.strict.checks.bucketing)
  • Restrict Unsafe Data Type Comparisons (hive.strict.checks.type.safety)
  • Restrict Queries with ORDER BY but no LIMIT clause (hive.strict.checks.orderby.no.limit)
  • Restrict Partitioned Table Scans with no Partitioned Column Filter (hive.strict.checks.no.partition.filter)
  • Restrict Cross Joins (Cartesian Products) (hive.strict.checks.cartesian.product)

For more information about these configuration properties, see HIVE-12727, HIVE-15148, HIVE-18251, and HIVE-18552.

Java XML Serialization Has Been Removed

The configuration property hive.plan.serialization.format has been removed. Previously, this configuration property could be set to either javaXML or kryo. Now the default is kryo serialization, which cannot be changed. For more information, see HIVE-12609 and the Apache wiki.

Configuration Property Enabling Column Position Usage with GROUP BY and ORDER BY Separated into Two Properties

The configuration property hive.groupby.orderby.position.alias, which enabled using column position with the GROUP BY and the ORDER BY clauses has been removed and replaced with the following two configuration properties. These configuration properties enable using column position with GROUP BY and ORDER BY separately:

Configuration Property Name Description/Default Setting Possible Values
hive.groupby.position.alias When set to true, specifies that columns can be referenced with their position when using GROUP BY clauses in queries. Default Setting: false. This behavior is turned off by default. true | false
hive.orderby.position.alias When set to true, specifies that columns can be referenced with their position when using ORDER BY clauses in queries. Default Setting: true. This behavior is turned on by default. true | false

For more information, see HIVE-15797 and the Apache wiki entries for configuration properties, GROUP BY syntax, and ORDER BY syntax.

HiveServer2 Impersonation Property (hive.server2.enable.impersonation) Removed

In earlier versions of CDH, the following two configuration properties could be used to set impersonation for HiveServer2:

  • hive.server2.enable.impersonation
  • hive.server2.enable.doAs

In CDH 6.0, hive.server2.enable.impersonation is removed. To configure impersonation for HiveServer2, use the configuration property hive.server2.enable.doAs. To set this property in Cloudera Manager, select the Hive service and click on the Configuration tab. Then search for the HiveServer2 Enable Impersonation setting and select the checkbox to enable HiveServer2 impersonation. This property is enabled by default in CDH 6.

For more information about this property, see the Apache wiki documentation for HiveServer2 configuration properties.

Changed Default File Format for Storing Intermediate Query Results

The configuration property hive.query.result.fileformat controls the file format in which a query's intermediate results are stored. In CDH 6, the default setting for this property has been changed from TextFile to SequenceFile.

To change this configuration property in Cloudera Manager:

  1. In the Admin Console, select the Hive service and click on the Configuration tab.
  2. Then search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml setting and add the following information:

    • Name: hive.query.result.fileformat
    • Value: Valid values are TextFile, SequenceFile (default), or RCfile
    • Description: Sets the file format in which a query's intermediate results are stored.
  3. After you add this information, click Save Changes and restart the Hive service.

For more information about this parameter, see the Apache wiki.

HiveServer2 Thrift API Code Repackaged Resulting in Class File Location Changes

HiveServer2 Thrift API code has been repackaged in CDH 6.0, resulting in the following changes:

  • All files generated by the Thrift API for HiveServer2 have moved from the following old namespace:

    org.apache.hive.service.cli.thrift

    To the following new namespace:

    org.apache.hive.service.rpc.thrift

  • All files generated by the Thrift API for HiveServer2 have moved into a separate jar file called service-rpc.

As a result of these changes, all Java classes such as TCLIService.java, TOpenSessionReq.java, TSessionHandle.java, and TGetSchemasReq.java have changed locations. For more information, see HIVE-12442.

Values Returned for Decimal Numbers Are Now Padded with Trailing Zeroes to the Scale of the Specified Column

Decimal values that are returned in query results are now padded with trailing zeroes to match the specified scale of the corresponding column. For example, before this change, when Hive read a decimal column with a specified scale of 5, the value returned for zero was returned as 0. Now, the value returned for zero is 0.00000. For more information, see HIVE-12063.

Hive Logging Framework Switched to SLF4J/Log4j 2

The logging framework for Hive has switched to SLF4J (Simple Logging Facade for Java) and now uses Log4j 2 by default. Use of Log4j 1.x, Apache Commons Logging, and java.util.logging have been removed. To accommodate this change, write all Log4j configuration files to be compatible with Log4j 2.

For more information, see HIVE-12237, HIVE-11304, and the Apache wiki.

Deprecated Parquet Java Classes Removed from Hive

The deprecated parquet classes, parquet.hive.DeprecatedParquetInputFormat and parquet.hive.DeprecatedParquetOutputFormat have been removed from Hive because they resided outside of the org.apache namespace. Any existing tables that use these classes are automatically migrated to the new SerDe classes when the metastore is upgraded.

Use one of the following options for specifying the Parquet SerDe for new Hive tables:

  • Specify in the CREATE TABLE statement that you want it stored as Parquet. For example:

    CREATE TABLE <parquet_table_name> (col1 INT, col2 STRING) STORED AS PARQUET;
                
  • Set the INPUTFORMAT to org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat and set the OUTPUTFORMAT to org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat. For example:

    CREATE TABLE <parquet_table_name> (col1 INT, col2 STRING)
    STORED AS
         INPUTFORMAT "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
         OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat";
                

For more information, see HIVE-6757 and the Apache wiki.

Removed JDBC, Counter-based, and HBase-based Statistics Collection Mechanisms

Support for JDBC, counter-based, and HBase-based statistics collection mechanisms has been removed from Hive. The following configuration properties are no longer supported:

  • hive.stats.dbclass
  • hive.stats.retries.wait
  • hive.stats.retries.max
  • hive.stats.jdbc.timeout
  • hive.stats.dbconnectionstring
  • hive.stats.jdbcdrive
  • hive.stats.key.prefix.reserve.length

This change also removed the cleanUp(String keyPrefix) method from the StatsAggregator interface.

Now all Hive statistics are collected on the default file system. For more information, see HIVE-12164, HIVE-12411, HIVE-12005, and the Apache wiki.

S3N Connector Is Removed from CDH 6.0

The S3N connector, which is used to connect to the Amazon S3 file system from Hive has been removed from CDH 6.0. To connect to the S3 file system from Hive in CDH 6.0, you must now use the S3A connector. There are a number of differences between the S3N and the S3A connectors, including configuration differences. See the Apache wiki page on integrating with Amazon Web Services for details.

Migration involves making the following changes:

  • Changing all metastore data containing URIs that start with s3n:// to s3a://. This change is performed automatically when you upgrade the Hive metastore.
  • Changing all scripts containing URIs that start with s3n:// to s3a://. You must perform this change manually.

Columns Added to TRowSet Returned by the Thrift TCLIService#GetTables Request

Six additional columns have been added to the TRowSet that is returned by the TCLIService#GetTables request. These columns were added to comply with the official JDBC API. For more information, see the documentation for java.sql.DatabaseMetaData.

The columns added are:

Column Name Description
REMARKS Explanatory comment on the table.
TYPE_CAT Types catalog.
TYPE_SCHEMA Types schema.
TYPE_NAME Types name.
SELF_REFERENCING_COL_NAME Name of the designed identifier column of a typed table.
REF_GENERATION Specifies how values in the SELF_REFERENCING_COL_NAME column are created.

For more information, see HIVE-7575.

Support Added for Escaping Carriage Returns and New Line Characters for Text Files (LazySimpleSerDe)

Support has been added for escaping carriage returns and new line characters in text files by modifying the LazySimpleSerDe class. Without this change, carriage returns and new line characters are interpreted as delimiters, which causes incorrect query results.

This feature is controlled by the SerDe property serialization.escape.crlf. It is enabled (set to true) by default. If serialization.escape.crlf is enabled, 'r' or 'n' cannot be used as separators or field delimiters.

This change only affects text files and removes the getNullString method from the LazySerDeParameters class. For more information, see HIVE-11785.

Bucketing and Sorting Enforced by Default When Inserting Data into Hive Tables

The configuration properties hive.enforce.sorting and hive.enforce.bucketing have been removed. When set to false, these configurations disabled enforcement of sorted and bucketed tables when data was inserted into a table. Removing these configuration properties effectively sets these properties to true. In CDH 6.0, bucketing and sorting are enforced on Hive tables during insertions and cannot be turned off. For more information, see the Apache wiki topic on hive.enforce.bucketing and the topic on hive.enforce.sorting.

Hue

There are no incompatible changes in this release.

Apache Impala

List of Reserved Words Updated

The list of reserved words in Impala was updated in CDH 6.0.

If you need to use the reserved words from previous versions of CDH, set the impalad and catalogd startup option, reserved_words_version, to "2.11.0".

Decimal V2 Used by Default

In Impala, two different behaviors of DECIMAL types are supported. In CDH 6.0, DECIMAL V2 is used by default. See DECIMAL Type for detail information.

If you need to continue using the first version of the DECIMAL type for the backward compatibility of your queries, set the DECIMAL_V2 query option to FALSE.

Behavior of Column Aliases Changed

To conform to the SQL standard, Impala no longer performs alias substitution in the subexpressions of GROUP BY, HAVING, and ORDER BY.

For example, the following statements will now result in syntax errors.
SELECT int_col / 2 AS x
FROM functional.alltypes
GROUP BY x / 2;

SELECT int_col / 2 AS x
FROM functional.alltypes
ORDER BY -x;

SELECT int_col / 2 AS x
FROM functional.alltypes
GROUP BY x
HAVING x > 3;

Default PARQUET_ARRAY_RESOLUTION Changed

The PARQUET_ARRAY_RESOLUTION query option controls the path-resolution behavior for Parquet files with nested arrays. The default value for the PARQUET_ARRAY_RESOLUTION was changed to THREE_LEVEL in CDH 6.0. Review your queries to see if the default value change result in different result sets.

See PARQUET_ARRAY_RESOLUTION Query Option for the information about the option.

Non-standard Timezone Names Unsupported

As the initial step for IANA timezone integration in the coming release, Impala will drop the support for non-standard timezone aliases in CDH 6.0.

Impala supports a majority of the IANA time zones with the following exceptions of time zones not supported: America/Fort_Nelson, America/Punta_Arenas, Asia/Atyrau, Asia/Barnaul, Asia/Famagusta, Asia/Tomsk, Asia/Yangon, Europe/Astrakhan, Europe/Kirov, Europe/Saratov, Europe/Ulyanovsk, GMT+0, GMT-0, ROC

See Unsupported Time Zone for the list of time zone aliases no longer supported and the canonical names you can use to replace the unsupported aliases with.

Return Type Changed for EXTRACT and DATE_PART Functions in CDH 6.0 / Impala 3.0

The following changes were made to the EXTRACT and DATE_PART functions:
  • The output type of the EXTRACT and DATE_PART functions was changed to BIGINT.
  • Extracting the millisecond part from a TIMESTAMP returns the seconds component and the milliseconds component. For example, EXTRACT (CAST('2006-05-12 18:27:28.123456789' AS TIMESTAMP), 'MILLISECOND') will return 28123.

Apache Kafka

Kafka is now bundled as part of CDH. The following sections describe incompatible changes between the previous, separately installed Kafka (CDK powered by Apache Kafka version 3.1) and the CDH 6.0.0 Kafka version. These changes affect clients built with CDH 6.0.0 libraries. Cloudera recommends upgrading clients to the new release; however clients built with previous versions of Kafka will continue to function.

Packaging

CDH and previous distributions of Kafka (CDK Powered by Apache Kafka) cannot coexist in the same cluster.

Deprecated Scala-based Client API and New Java Client API

Scala-based clients are deprecated in this release and will be removed in an upcoming release.

The following Scala-based client implementations from package kafka.* (known as 'old clients') are deprecated and unsupported as of CDH 6.0.0:

  • kafka.consumer.*
  • kafka.producer.*
  • kafka.admin.*

Client applications making use of these implementations must be migrated to corresponding Java clients available in org.apache.kafka.* (the 'new clients') package. Existing command line options and tools now use the new clients package.

Command Line Options Removed

Some command line tools are affected by the removal of old clients (see previous entry). The following options have been removed and are not recognized as valid options:
  • --new-consumer
  • --old-consumer
  • --old-producer
The tools affected use the new clients.

Command Line Tools Removed

The following command line tools and runnable classes are removed:

  • kafka-replay-log-producer
  • kafka-simple-consumer-shell
  • kafka.tools.ReplayLogProducer
  • kafka.tools.SimpleConsumerShell
  • kafka.tools.ExportZkOffset
  • kafka.tools.ImportZkOffset
  • kafka.tools.SimpleConsumerPerformance
  • kafka.tools.UpdateOffsetsInZK
  • kafka.tools.VerifyConsumerRebalance
  • kafka.tools.ProducerPerformance

Consumer API Changes

Consumer methods invoked with unassigned partitions now raise an IllegalStateException instead of an IllegalArgumentException.

Previous versions of the Consumer method poll(long) would wait for metadata updates regardless of timeout parameter. This behavior is expected to change in future releases; make sure your client applications include an appropriate timeout parameter and do not rely on the previous behavior.

Exception Classes Removed

The following exceptions were deprecated in a previous release and are not thrown anymore are removed:

  • GroupCoordinatorNotAvailableException
  • GroupLoadInProgressException
  • NotCoordinatorForGroupException
  • kafka.common.KafkaStorageException

Metrics Updated

Kafka consumers' per-partition metrics were changed to use tags for topic and partition rather than the metric name. For more information see KIP-225.

Apache Kudu

There are no incompatible changes in this release.

Apache Oozie

There are no incompatible changes in this release.

Apache Parquet

Packages and Group ID Renamed

As a part of the Apache incubation process, all Parquet packages and the project’s group ID were renamed as follows:

Parquet Version 1.6.0 and lower (CDH 5.x) 1.7.0 and higher (CDH 6.x)
Java Package Names parquet.* org.apache.parquet.*
Group ID com.twitter org.apache.parquet

If you directly consume the Parquet API, instead of using Parquet through Hive, Impala or other CDH component, you need to update your code to reflect these changes:

Update *.java files:
Before After
import parquet.*; import org.apache.parquet.*;
Update pom.xml:
Before After
<dependency>
  <groupId>
    com.twitter
  </groupId>
  <version>
    ${parquet.version}
  </version>
</dependency>
<dependency>
  <groupId>
    org.apache.parquet
  </groupId>
  <version>
    ${parquet.version}
  </version>
</dependency>

API Methods Removed

In Parquet 1.6, a number of API methods were removed from the parquet.hadoop.ParquetInputSplit class that depended on reading metadata on the client side. Metadata should be read on the task side instead.
Removed Method New Method to Use
parquet.hadoop.ParquetInputSplit.getFileSchema org.apache.parquet.hadoop.api.InitContext.getFileSchema
parquet.hadoop.ParquetInputSplit.getRequestedSchema org.apache.parquet.hadoop.api.ReadSupport.ReadContext.getRequestedSchema
parquet.hadoop.ParquetInputSplit.getReadSupportMetadata org.apache.parquet.hadoop.api.ReadSupport.ReadContext.getReadSupportMetadata
parquet.hadoop.ParquetInputSplit.getBlocks org.apache.parquet.hadoop.metadata.ParquetMetadata.getBlocks
parquet.hadoop.ParquetInputSplit.getExtraMetadata -

Apache Pig

The following change is introduced to Pig in CDH 6.0 and is not a backwards compatible change. You must modify your Pig scripts as described below.

Removal of the Apache DataFu Pig JAR from CDH 6

Apache DataFu Pig is a collection of user-defined functions that can be used with Pig for data mining and statistical analysis on large-scale data. The DataFu JAR was included in CDH 5, but due to very low adoption rates, the JAR was deprecated in CDH 5.9 and is being removed from CDH 6, starting with CDH 6.0. It is no longer supported.

Recommended Migration Strategy

A simple way to assess what DataFu functions you are using in your Pig scripts is to use the grep utility to search for occurrences of "datafu" in your code. When DataFu functions are used in Pig scripts, you must use a function definition entry that contains "datafu" like the following example:
define <function_name> datafu.pig... .<class_name>();
           
Use grep to search for the string "datafu" in your scripts and that will identify where the DataFu JAR is used.

Cloudera recommends migrating to Hive UDFs or operators wherever it is possible. However, if there are cases where it is impossible to replace DataFu functions with Hive functions, download the upstream version of the DataFu Pig libraries and place them on the node where the Pig front end is used. To preserve compatibility, use the version 1.1.0 JAR, which was the version included in CDH 5. You can download the JAR file here. However, Cloudera does not support using this upstream DataFu JAR file.

Mapping DataFu UDFs to Hive UDFs

The following Hive UDFs map to DataFu UDFs and can be used instead in Pig scripts with the caveats that are listed:

Hive Functions That Map to DataFu Functions
DataFu Function (package) Description Hive UDF or Operator Equivalent Caveats
MD5 (hash) Computes the MD5 value of a string and outputs a hex value by default. md5 None
SHA (hash) Computes the SHA value of a string and outputs a hex value by default. sha/sha1 None
RandInt (random) Generates a uniformly distributed integer between two bounds. rand None
VAR (stats) Generates the variance of a set of values. variance None
Median (stats) Computes the median for a sorted input bag. A special case of the Quantile function. percentile(0.5) See Limitations When Substituting Quantile and Median DataFu Functions.
Quantile (stats) Computes quantiles for a sorted input bag. percentile See Limitations When Substituting Quantile and Median DataFu Functions.
Coalesce (util) Returns the first non-null value from a truple, like COALESCE in SQL. coalesce None
InUDF (util) Similar to the SQL IN function, this function provides a convenient way to filter using a logical disjunction over many values. IN None

For more information about using Hive UDFs, see https://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_hive_udf.html.

Limitations When Substituting Quantile and Median DataFu Functions

With the exception of Median and Quantile all Hive functions specified in the above table should work as expected in Pig scripts. Median extends Quantile in DataFu functions and the equivalent Hive functions have a similar relationship. However, there is an important difference in how you use percentile and how you use Quantile. The differences are summarized in the following table:

Differences Between Usage of DataFu 'Quantile' and Hive 'percentile'
Function: Quantile percentile
Input must be sorted: Yes No
Nulls are allowed in input: No Yes
Examples:
register datafu-1.2.0.jar;
define median datafu.pig.stats.Quantile('0.5');
A = LOAD 'nums.txt' AS (name:chararray, n:long);
B = group A by name;
C = foreach B {
    sorted = order A by n;
    generate group,
flatten (median(sorted.n ));
}
                       
define percentile HiveUDAF ('percentile');
A = LOAD 'nums.txt' AS (name:chararray, n:long);
B2 = foreach A generate name, n, 0.5 as perc;
C2 = GROUP B2 by name;
D2 = FOREACH C2 generate group, percentile (B2. (n, perc));
                       

Although DataFu StreamingQuantile and StreamingMedian might appear to match Hive's percentile_approx function, Pig cannot consume percentile_approx.

DataFu Functions with No Hive Function or Operator Equivalent

The following general limitations apply when mapping DataFu UDFs to Hive UDFs:

  • Many DataFu functions operate on a custom Pig data structure called a bag. No Hive UDFs can operate on Pig bags, so there are no equivalents for these DataFu functions.
  • Some DataFu functions are custom functions that do not have Hive UDF equivalents. For example, the DataFu functions that calculate geographic distances, run the PageRank algorithm, or that do sampling. There are no equivalent Hive UDFs for these DataFu functions either.
DataFu Functions with No Hive UDF Equivalent
AppendToBag (bags) AssertUDF (util) BagConcat (bags)
BagGroup (bags) BagLeftOuterJoin (bags) BagSplit (bags)
BoolToInt (util) CountEach (bags) DistinctBy (bags)
EmptyBagToNull (bags) EmptyBagToNullFields (bags) Enumerate (bags)
FirstTupleFromBag (bags) HaversineDistInMiles (geo) IntToBool (util)
MarkovPairs NullToEmptyBag (bags) PageRank (linkanalysis)
PrependToBag (bags) ReservoirSample (sampling)* ReverseEnumerate (bags)
SampleByKey (sampling)* SessionCount (sessions) Sessionize (sessions)
SetIntersect (sets) SetUnion (sets) SimpleRandomSample (sampling)*
TransposeTupleToBag (util) UnorderedPairs (bags) UserAgentClassify (urls)
WeightedSample (sampling)* WilsonBinConf (stats)

* These DataFu functions might be replaced with TABLESAMPLE in HiveQL. See the Apache Hive wiki.

Cloudera Search

Cloudera Search in CDH 6.0 is rebased on Apache Solr 7.0, which has many incompatibilities with the 4.10 version of Apache Solr used in recent CDH 5 releases, such as the following:

  • Solr 7 uses a managed schema by default. Generating an instance directory no longer generates schema.xml. For instructions on switching to a managed schema, see Switching from schema.xml to Managed Schema in Apache Solr Reference Guide.
  • Creating a collection using solrctl collection --create without specifying the -c <configName> parameter now uses a default configuration set (named _default) instead of a configuration set with the same name as the collection. To avoid this, always specify the -c <configName> parameter when creating new collections.

For the full list of changes, see the upstream release notes:

Apache Sentry

Apache Sentry contains the following incompatible change in CDH 6.0.0:

  • Sentry no longer supports policy file authorization. You must migrate policy files to the database-backed Sentry service before you upgrade to CDH 6.0.0 unless you are using Sentry policy files for Solr. If you are using Sentry policy files for Solr, you must migrate to the database-backed Sentry service after you upgrade.

    For information about migrating policy files before you upgrade, see Migrating from Sentry Policy Files to the Sentry Service. For information about migrating policy files for Solr after you upgrade, see Migrating Sentry Privileges for Solr After Upgrading to CDH 6.

Apache Spark

The following sections describe changes in Spark support in CDH 6 that might require special handling during upgrades, or code changes within existing applications.

  • All Spark applications built against Spark 1.6 in CDH 5 must be rebuilt against Spark 2.x in CDH 6.

  • Spark 2 in CDH 6 works with Java 8, not Java 7. If this change produces any Java code incompatibilities, update your Java code and rebuild the application.

  • Spark 2 in CDH 6 works with Scala 2.11, not Scala 2.10. If this change produces any Scala code incompatibilities, update your Scala code and rebuild the application.

  • HiveContext and SQLContext have been removed, although those variables still work for backward compatibility. Use the SparkSession object to replace both of these handles.

  • DataFrames have been removed from the Scala API. DataFrame is now a special case of Dataset.

    Since compile-time type-safety in Python and R is not a language feature, the concept of Dataset does not apply to these languages' APIs. Instead, DataFrame remains the primary programing abstraction.

  • Spark 2.0 and higher do not use an assembly JAR for standalone applications.

  • If you have event logs created in CDH 5.3 or lower, you cannot read those logs using Spark in CDH 6.0 or higher.

Apache Sqoop

The following changes are introduced in CDH 6.0, and are not backwards compatible:

  • All classes in com.cloudera.sqoop packages have been removed in CDH 6.0. Use the corresponding classes from org.apache.sqoop packages. For example, use org.apache.sqoop.SqoopOptions instead of com.cloudera.sqoop.SqoopOptions.
  • Because of changes introduced in the Sqoop metastore logic, the metastore database created by Sqoop CDH 6 cannot be used by earlier versions. The metastore database created by Sqoop CDH 5 can be used by both Sqoop CDH 5 and Sqoop CDH 6.

Require an explicit option to be specified with --split-by for a String column

Using the --split-by option with a CHAR or VARCHAR column does not always work properly, so Sqoop now requires the user to set the org.apache.sqoop.splitter.allow_text_splitter property to true to confirm that they are aware of this risk.

Example:
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect $MYCONN 
--username $MYUSER --password $MYPSWD --table "test_table" --split-by "string_column"

For more information, see SQOOP-2910.

Make Sqoop fail if user specifies --direct connector when it is not available

The --direct option is only supported with the following databases: MySQL, PostgreSQL, Oracle, Netezza.

In earlier releases, Sqoop silently ignored this option if it was specified for other databases, but it now throws an error.

Example:
sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table "direct_import" --direct
The command fails with the following error message:
Was called with the --direct option, but no direct connector available.

For more information, see SQOOP-2913.

Sqoop does not allow --as-parquetfile with hcatalog jobs or when hive import with create-hive-table is used

The --create-hive-table option is not supported when the user imports into Hive in Parquet format. Earlier this option was silently ignored, and the data was imported even if the Hive table existed. Sqoop will now fail if the --create-hive-table option is used with the --as-parquetfile option.

Example:
sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table 
"test_table" --hive-import --as-parquetfile --create-hive-table
The command fails with the following error message:
Hive import and create hive table is not compatible with importing into ParquetFile format.

For more information, see SQOOP-3010.

Create fail fast for export with --hcatalog-table <HIVE_VIEW>

Importing into and exporting from a Hive view using HCatalog is not supported by Sqoop. A fail fast check was introduced so that now Sqoop throws a descriptive error message if the user specified a Hive view in the value of the --hcatalog-table option.

Example:
sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD --table "test_table" 
--hcatalog-table "test_view"
The command fails with the following error message:
Reads/Writes from and to Views are not supported by HCatalog

For more information, see SQOOP-3027.

Simplify Unicode character support in source files

Simplify Unicode character support in source files (introduced by SQOOP-3074) by defining explicit locales instead of using EscapeUtils. The Java source files generated by Sqoop will be encoded in UTF-8 format.

For more information, see SQOOP-3075.

Columns added to MySql after initial Sqoop import, export back to table with same schema fails

If we export from HDFS to an RDBMS table and the file on HDFS has no value for some of the columns defined in the table, Sqoop will use the values of --input-null-string and --input-null-non-string options. Earlier this scenario was not supported and Sqoop failed.

For more information, see SQOOP-3158.

Sqoop fails if the user tries to encode a null value when using --direct connector and a MySQL database

The MySQL direct connector does not support the --null-string, --null-non-string, --input-null-string, and --input-null-non-string options. These options were silently ignored earlier, but Sqoop now throws an error if these options are used with MySQL direct imports and exports.

For more information, see SQOOP-3206.

Apache Zookeeper

There are no incompatible changes in this release.