Apache Hive Components Incompatible Changes in CDH 6.0
See below for Hive changes that are not backwards compatible in CDH 6.0.
- UNION ALL Statements Involving Data Types from Different Type Groups No Longer Use Implicit Type Casting
- OFFLINE and NO_DROP Options Removed from Table and Partition DDL
- DESCRIBE Query Syntax Change
- CREATE TABLE Change: Periods and Colons No Longer Allowed in Column Names
- Reserved and Non-Reserved Keyword Changes in HiveQL
- Hive API Changes
- Hive Configuration Changes
- HiveServer2 Thrift API Code Repackaged Resulting in Class File Location Changes
- Values Returned for Decimal Numbers Are Now Padded with Trailing Zeroes to the Scale of the Specified Column
- Hive Logging Framework Switched to SLF4J/Log4j 2
- Deprecated Parquet Java Classes Removed from Hive
- Removed JDBC, Counter-based, and HBase-based Statistics Collection Mechanisms
- S3N Connector Is Removed from CDH 6.0
- Columns Added to TRowSet Returned by the Thrift TCLIService#GetTables Request
- Support Added for Escaping Carriage Returns and New Line Characters for Text Files (LazySimpleSerDe)
UNION ALL Statements Involving Data Types from Different Type Groups No Longer Use Implicit Type Casting
Prior to this change, Hive performed implicit casts when data types from different type groups were specified in queries that use UNION ALL. For example, before CDH 6.0, if you had the two following tables:
+------------+------------+------------+--+ | one.col_1 | one.col_2 | one.col_3 | +------------+------------+------------+--+ | 21 | hello_all | b | +------------+------------+------------+--+
Where col_1 datatype is int, col_2 datatype is string, and col_3 datatype is char(1).
+------------+------------+------------+--+ | two.col_4 | two.col_5 | two.col_6 | +------------+------------+------------+--+ | 75.0 | abcde | 45 | +------------+------------+------------+--+
Where col_4 datatype is double, col_5 datatype is varchar(5), and col_6 datatype is int.
And you ran the following UNION ALL query against these two tables:
SELECT * FROM one UNION ALL SELECT col_4 AS col_1, col_5 AS col_2, col_6 AS col_3 FROM two;
You received the following result set:
+------------+------------+------------+--+ | _u1.col_1 | _u1.col_2 | _u1.col_3 | +------------+------------+------------+--+ | 75.0 | abcde | 4 | | 21.0 | hello | b | +------------+------------+------------+--+
Note that this statement implicitly casts the values from table one with the following errors resulting in data loss:
- one.col_1 is cast to a double datatype
- one.col_2 is cast to a varchar(5) datatype, which truncates the original value from hello_all to hello
- one.col_3 is cast to a char(1) datatype, which truncates the original value from 45 to 4
In CDH 6.0, no implicit cast is performed across different type groups. For example, STRING, CHAR, and VARCHAR are in one type group, and INT, BIGINT, and DECIMAL are in another type group, and so on. So, in CDH 6.0 and later, the above query that uses UNION ALL, returns an exception for the columns that contain datatypes that are not part of a type group. In CDH 6.0 and later, Hive performs the implicit cast only within type groups and not across different type groups. For more information, see HIVE-14251.
OFFLINE and NO_DROP Options Removed from Table and Partition DDL
Support for Hive table and partition protection options have been removed in CDH 6.0, which includes removal of the following functionality:
Support has been removed for:
- ENABLE | DISABLE NO_DROP [CASCADE]
- ENABLE | DISABLE OFFLINE
- ALTER TABLE … IGNORE PROTECTION
The following support has also been removed from the HiveMetastoreClient class:
The ignoreProtection parameter has been removed from the dropPartitions methods in the IMetaStoreClient interface.
For more information, see HIVE-11145.
Cloudera recommends that you use Apache Sentry to replace most of this functionality. Although Sentry governs permissions on ALTER TABLE, it does not include permissions that are specific to a partition. See Authorization Privilege Model for Hive and Impala and Configuring the Sentry Service.
DESCRIBE Query Syntax Change
In CDH 6.0 syntax has changed for DESCRIBE queries as follows:
DESCRIBE queries where the column name is separated by the table name using a period is no longer supported:
Instead, the table name and column name must be separated with a space:
DESCRIBE testTable testColumn;
The partition_spec must appear after the table name, but before the optional column name:
DESCRIBE default.testTable PARTITION (part_col = 100) testColumn;
CREATE TABLE Change: Periods and Colons No Longer Allowed in Column Names
Reserved and Non-Reserved Keyword Changes in HiveQL
Hive reserved and non-reserved keywords have changed in CDH 6.0. Reserved keywords cannot be used as table or column names unless they are enclosed with back ticks (for example, `data`). Non-reserved keywords can be used as table or column names without enclosing them with back ticks. Non-reserved keywords have proscribed meanings in HiveQL, but can still be used as table or column names. For more information about the changes to reserved and non-reserved words listed below, see HIVE-6617 and HIVE-14872.
In CDH 6.0, the following changes have occurred with Hive reserved and non-reserved keywords:
Hive New Reserved Keywords Added in CDH 6.0
The following table contains new reserved keywords that have been added:
Hive Non-Reserved Keywords Converted to Reserved Keywords in CDH 6.0
The following table contains non-reserved keywords that have been converted to be reserved keywords:
Hive Reserved Keywords Converted to Non-Reserved Keywords in CDH 6.0
The following table contains reserved keywords that have been converted to be non-reserved keywords:
Hive New Non-Reserved Keywords Added in CDH 6.0
The following table contains new non-reserved keywords that have been added:
Hive Non-Reserved Keyword Removed in CDH 6.0
The following non-reserved keyword has been removed:
Apache Hive API Changes in CDH 6.0
The following backwards incompatible changes have been made to the Hive API in CDH 6.0:
AddPartitionMessage.getPartitions() Can Return NULL
The getPartitions() method has been removed from the AddPartitionEvent class in the org.apache.hadoop.hive.metastore.events interface. It was removed to prevent out-of-memory errors when the list of partitions is too large.
DropPartitionEvent and PreDropPartitionEvent Class Changes
The getPartitions() method has been removed and replaced by the getPartitionIterator() method in the DropPartitionEvent class and the PreDropPartitionEvent class.
In addition, the (Partition partition, boolean deleteData, HiveMetastore.HMSHandler handler) constructors have been deleted from the PreDropPartitionEvent class. For more information, see HIVE-9674 and the PreDropPartitionEvent documentation.
GenericUDF.getTimestampValue Method Now Returns Timestamp Instead of Date
GenericUDF.getConstantLongValue Has Been Removed
Increased Width of Hive Metastore Configuration Columns
The columns used for configuration values in the Hive metastore have been increased in width, resulting in the following incompatible changes in the org.apache.hadoop.hive.metastore.api interface.
This change introduced an incompatible change to the get_table_names_by_filter method of the ThriftHiveMetastore class. Before this change, this method accepts a string filter, which allows clients to filter a table by its TABLEPROPERTIES value. For example:
org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.HIVE_FILTER_FIELD_ PARAMS + "test_param_1 <> \"yellow\""; org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.HIVE_FILTER_FIELD_ PARAMS + "test_param_1 = \"yellow\"";
After this change, the TABLE_PARAMS.PARAM_VALUE column is now a CLOB data type. Depending on the type of database that you use (for example, MySQL, Oracle, or PostgresSQL), the semantics may have changed and operators like "=", "<>", and "!=" might not be supported. Refer to the documentation for your database for more information. You must use operators that are compatible with CLOB data types. There is no equivalent "<>" operator that is compatible with CLOB. So there is no equivalent operator for the above example that uses the "<>" inequality operator. The equivalent for "=" is the LIKE operator so you would rewrite the second example above as:
org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.HIVE_FILTER_FIELD_ PARAMS + "test_param_1 LIKE \"yellow"";
For more information, see HIVE-12274.
Apache Hive Configuration Changes in CDH 6.0
The following backwards incompatible changes have been made to Hive configuration properties in CDH 6.0:
- Bucketing and Sorting Enforced by Default When Inserting Data into Hive Tables
- Hive Throws an Exception When Processing HDFS Directories Containing Unsupported Characters
- Hive Strict Checks Have Been Re-factored To Be More Granular
- Java XML Serialization Has Been Removed
- Configuration Property Enabling Column Position Usage with GROUP BY and ORDER BY Separated into Two Properties
- HiveServer2 Impersonation Property (hive.server2.enable.impersonation) Removed
- Changed Default File Format for Storing Intermediate Query Results
Bucketing and Sorting Enforced by Default When Inserting Data into Hive Tables
The configuration properties hive.enforce.sorting and hive.enforce.bucketing have been removed. When set to false, these configurations disabled enforcement of sorted and bucketed tables when data was inserted into a table. Removing these configuration properties effectively sets these properties to true. In CDH 6.0, bucketing and sorting are enforced on Hive tables during insertions and cannot be turned off. For more information, see the Apache wiki topic on hive.enforce.bucketing and the topic on hive.enforce.sorting.
Hive Throws an Exception When Processing HDFS Directories Containing Unsupported Characters
Directories in HDFS can contain unprintable or unsupported characters that are not visible even when you run the hadoop fs -ls command on the directories. When external tables are created with the MSCK REPAIR TABLE command, the partitions using these HDFS directories that contain unsupported characters are unusable for Hive. To avoid this, the configuration parameter hive.msck.path.validation has been added. This configuration property controls the behavior of the MSCK REPAIR TABLE command, enabling you to set whether validation checks are run on the HDFS directories when MSCK REPAIR TABLE is run.
The property hive.msck.path.validation can be set to one of the following values:
|throw||Causes Hive to throw an exception when it tries to process an HDFS directory that contains unsupported characters with the MSCK REPAIR TABLE command. This is the default setting for hive.msck.path.validation.|
|skip||Causes Hive to skip the skip the directories that contain unsupported characters, but still repairs the others.|
|ignore||Causes Hive to completely skip any validation of HDFS directories when the MSCK REPAIR TABLE command is run. This setting can cause bugs because unusable partitions are created.|
By default, the hive.msck.path.validation property is set to throw, which causes Hive to throw an exception when MSCK REPAIR TABLE is run and HDFS directories containing unsupported characters are encountered. To work around this, set this property to skip until you can repair the HDFS directories that contain unsupported characters.
To set this property in Cloudera Manager:
- In the Admin Console, select the Hive service.
- Click the Configuration tab.
- Search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml setting.
- In the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml setting, add the Name of the property, the Value (throw, skip, or ignore), and a Description of the setting.
- Click Save Changes and restart the service.
For more information, see HIVE-10722.
Hive Strict Checks Have Been Re-factored To Be More Granular
Originally, the configuration property hive.mapred.mode was added to restrict certain types of queries from running. Now it has been broken down into more fine-grained configurations, one for each type of restricted query pattern. The configuration property hive.mapred.mode has been removed and replaced with the following configuration properties, which provide more granular control of Hive strict checks:
|Configuration Property||Description||Default Value|
|hive.strict.checks.bucketing||When set to true, running LOAD DATA queries against bucketed tables is not allowed.||true. This is a backwards incompatible change.|
|hive.strict.checks.type.safety||When set to true, comparing bigint to string data types or bigint to double data types is not allowed.||true. This is a backwards incompatible change.|
|hive.strict.checks.orderby.no.limit||When set to true, prevents queries from being run that contain an ORDER BY clause with no LIMIT clause.||false|
|hive.strict.checks.no.partition.filter||When set to true, prevents queries from being run that scan a partitioned table but do not filter on the partition column.||false|
|hive.strict.checks.cartesian.product||When set to true, prevents queries from being run that contain a Cartesian product (also known as a cross join).||false|
All of these properties can be set with Cloudera Manager in the following configuration settings for the Hive service:
- Restrict LOAD Queries Against Bucketed Tables (hive.strict.checks.bucketing)
- Restrict Unsafe Data Type Comparisons (hive.strict.checks.type.safety)
- Restrict Queries with ORDER BY but no LIMIT clause (hive.strict.checks.orderby.no.limit)
- Restrict Partitioned Table Scans with no Partitioned Column Filter (hive.strict.checks.no.partition.filter)
- Restrict Cross Joins (Cartesian Products) (hive.strict.checks.cartesian.product)
Java XML Serialization Has Been Removed
Configuration Property Enabling Column Position Usage with GROUP BY and ORDER BY Separated into Two Properties
The configuration property hive.groupby.orderby.position.alias, which enabled using column position with the GROUP BY and the ORDER BY clauses has been removed and replaced with the following two configuration properties. These configuration properties enable using column position with GROUP BY and ORDER BY separately:
|Configuration Property Name||Description/Default Setting||Possible Values|
|hive.groupby.position.alias||When set to true, specifies that columns can be referenced with their position when using GROUP BY clauses in queries. Default Setting: false. This behavior is turned off by default.||true | false|
|hive.orderby.position.alias||When set to true, specifies that columns can be referenced with their position when using ORDER BY clauses in queries. Default Setting: true. This behavior is turned on by default.||true | false|
HiveServer2 Impersonation Property (hive.server2.enable.impersonation) Removed
In earlier versions of CDH, the following two configuration properties could be used to set impersonation for HiveServer2:
In CDH 6.0, hive.server2.enable.impersonation is removed. To configure impersonation for HiveServer2, use the configuration property hive.server2.enable.doAs. To set this property in Cloudera Manager, select the Hive service and click on the Configuration tab. Then search for the HiveServer2 Enable Impersonation setting and select the checkbox to enable HiveServer2 impersonation. This property is enabled by default in CDH 6.
For more information about this property, see the Apache wiki documentation for HiveServer2 configuration properties.
Changed Default File Format for Storing Intermediate Query Results
The configuration property hive.query.result.fileformat controls the file format in which a query's intermediate results are stored. In CDH 6, the default setting for this property has been changed from TextFile to SequenceFile.
To change this configuration property in Cloudera Manager:
- In the Admin Console, select the Hive service and click on the Configuration tab.
Then search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml setting and add the following information:
- Name: hive.query.result.fileformat
- Value: Valid values are TextFile, SequenceFile (default), or RCfile
- Description: Sets the file format in which a query's intermediate results are stored.
- After you add this information, click Save Changes and restart the Hive service.
For more information about this parameter, see the Apache wiki.
HiveServer2 Thrift API Code Repackaged Resulting in Class File Location Changes
HiveServer2 Thrift API code has been repackaged in CDH 6.0, resulting in the following changes:
All files generated by the Thrift API for HiveServer2 have moved from the following old namespace:
To the following new namespace:
- All files generated by the Thrift API for HiveServer2 have moved into a separate jar file called service-rpc.
As a result of these changes, all Java classes such as TCLIService.java, TOpenSessionReq.java, TSessionHandle.java, and TGetSchemasReq.java have changed locations. For more information, see HIVE-12442.
Values Returned for Decimal Numbers Are Now Padded with Trailing Zeroes to the Scale of the Specified Column
Decimal values that are returned in query results are now padded with trailing zeroes to match the specified scale of the corresponding column. For example, before this change, when Hive read a decimal column with a specified scale of 5, the value returned for zero was returned as 0. Now, the value returned for zero is 0.00000. For more information, see HIVE-12063.
Hive Logging Framework Switched to SLF4J/Log4j 2
The logging framework for Hive has switched to SLF4J (Simple Logging Facade for Java) and now uses Log4j 2 by default. Use of Log4j 1.x, Apache Commons Logging, and java.util.logging have been removed. To accommodate this change, write all Log4j configuration files to be compatible with Log4j 2.
Deprecated Parquet Java Classes Removed from Hive
The deprecated parquet classes, parquet.hive.DeprecatedParquetInputFormat and parquet.hive.DeprecatedParquetOutputFormat have been removed from Hive because they resided outside of the org.apache namespace. Any existing tables that use these classes are automatically migrated to the new SerDe classes when the metastore is upgraded.
Use one of the following options for specifying the Parquet SerDe for new Hive tables:
Specify in the CREATE TABLE statement that you want it stored as Parquet. For example:
CREATE TABLE <parquet_table_name> (col1 INT, col2 STRING) STORED AS PARQUET;
Set the INPUTFORMAT to org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat and set the OUTPUTFORMAT to org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat. For example:
CREATE TABLE <parquet_table_name> (col1 INT, col2 STRING) STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat";
Removed JDBC, Counter-based, and HBase-based Statistics Collection Mechanisms
Support for JDBC, counter-based, and HBase-based statistics collection mechanisms has been removed from Hive. The following configuration properties are no longer supported:
This change also removed the cleanUp(String keyPrefix) method from the StatsAggregator interface.
S3N Connector Is Removed from CDH 6.0
The S3N connector, which is used to connect to the Amazon S3 file system from Hive has been removed from CDH 6.0. To connect to the S3 file system from Hive in CDH 6.0, you must now use the S3A connector. There are a number of differences between the S3N and the S3A connectors, including configuration differences. See the Apache wiki page on integrating with Amazon Web Services for details.
Migration involves making the following changes:
- Changing all metastore data containing URIs that start with s3n:// to s3a://. This change is performed automatically when you upgrade the Hive metastore.
- Changing all scripts containing URIs that start with s3n:// to s3a://. You must perform this change manually.
Columns Added to TRowSet Returned by the Thrift TCLIService#GetTables Request
Six additional columns have been added to the TRowSet that is returned by the TCLIService#GetTables request. These columns were added to comply with the official JDBC API. For more information, see the documentation for java.sql.DatabaseMetaData.
The columns added are:
|REMARKS||Explanatory comment on the table.|
|SELF_REFERENCING_COL_NAME||Name of the designed identifier column of a typed table.|
|REF_GENERATION||Specifies how values in the SELF_REFERENCING_COL_NAME column are created.|
For more information, see HIVE-7575.
Support Added for Escaping Carriage Returns and New Line Characters for Text Files (LazySimpleSerDe)
Support has been added for escaping carriage returns and new line characters in text files by modifying the LazySimpleSerDe class. Without this change, carriage returns and new line characters are interpreted as delimiters, which causes incorrect query results.
This feature is controlled by the SerDe property serialization.escape.crlf. It is enabled (set to true) by default. If serialization.escape.crlf is enabled, 'r' or 'n' cannot be used as separators or field delimiters.