Apache Hive Known Issues

DirectSQL with PostgreSQL

Hive doesn't support Hive direct SQL queries with PostgreSQL database. It only supports this feature with MySQL MariaDB, and Oracle. With PostgresSQL, direct SQL is disabled as a precaution, since there have been issues reported upstream where it is not possible to fallback on DataNucleus in event of some failures, plus couple of other non-standard behaviors. For more information, see Hive Configuration Properties.

Affected Versions: CDH 5.13.0

Bug: None DOCS-2557

Increase width of columns used for general configuration in the metastore

HIVE-12274 was only partially backported to CDH. The HMS backend database changes are not included in CDH, as it would be a backward compatibility breaking change. The HMS schema won't be changed automatically during the update. It needs to be changed manually.

Affected Versions: CDH 5.13.0

Bug: HIVE-12274

Workaround: Change the HMS schema manually.

Hive SSL vulnerability bug disclosure

If you use Cloudera Hive JDBC drivers to connect your applications with HiveServer2, you are not affected if:

  • SSL is not turned on, or
  • SSL is turned on but only non-self-signed certificates are used.

If neither of the above statements describe your deployment, please read Security Bulletin Apache Hive SSL Vulnerability Bug Disclosure for further details.

Leak of threads from getInputPaths and getInputSummary thread pool can degrade HiveServer2 performance

The optimization made to Utilities.getInputPaths() method with HIVE-15546 introduced a thread pool that is not shut down when its threads complete their work. This leads to a thread leak for each query that runs in HiveServer2 and the leaked threads are not removed automatically. When queries span multiple partitions, the number of threads spawned increases, but they are never reduced. After 10,000 threads are reached, HiveServer2 performance slows.

Affected Versions: CDH 5.11.1, 5.11.2, 5.12.0, 5.12.1

Fixed in Versions: CDH 5.11.3, 5.12.2, 5.13.0

Bug: HIVE-16949, CDH-57789

Resolution: Upgrade to a fixed version of CDH or use the workaround.

Workaround: Upgrading to a fixed version of CDH is the preferred resolution for this issue, but if that is not immediately possible, you can set the Hive > Configuration > HiveServer2 > Performance > Input Listing Max Threads property to 1 in Cloudera Manager for managed clusters. If your cluster is not managed, set hive.exec.input.listing.max.threads=1 in the hive-site.xml file.

Hive metastore canary fails

The Hive metastore canary can fail in environments where the Hive-Sentry metastore plugin is deployed and where there are over one million Hive tables and partitions. In this situation, Sentry times out while waiting to receive a full snapshot from the Hive metastore.

Affected Versions: CDH 5.9.2, CDH 5.10.1, CDH 5.10.2, and CDH 5.11.1., 5.11.2, CDH 5.12 and higher

Bug: CDH-55255

Resolution: Use workaround.

Workaround: If your environment contains over one million Hive tables and partitions, before you upgrade to one of the affected versions, increase the Sentry timeout property for the metastore plugin to at least 10 to 20 minutes (600,000 to 1,200,000 ms). In Cloudera Manager, set the Hive Service Advanced Configuration Snippet (Safety Valve) for sentry-site.xml as follows:

  • Name: sentry.hdfs.service.client.server.rpc-connection-timeout
  • Value: 600000

After setting this safety valve, monitor your environment. If the Hive metastore canary continues to fail, increase the value by four-minute increments until Sentry can receive the full snapshot from the metastore. For example, if you set the safety valve to 600,000 ms and the canary fails, increase it to 840,000 ms. If it still fails, increase the value to 1,080,000 ms, and so on.

Hive Local Mode is not supported in production environments

Cloudera does not currently support the use of Hive Local Mode in production environments. This mode is set with the hive.exec.mode.local.auto property. Use this mode for testing purposes only.

Affected Versions: Not applicable.

Bug: None.

Workaround: None.

INSERT INTO overwrites EXTERNAL tables on the local filesystem

When EXTERNAL tables are located on the local filesystem (URIs beginning with file://), the INSERT INTO statement overwrites the table data. Defining EXTERNAL tables on the local filesystem is not a well-documented practice so its behavior is not well defined and is subject to change.

Affected Versions: CDH 5.10 and higher.

Fixed in Version: CDH 5.10.1

Bug: None.

Workaround: Change the table location from the local filesystem to an HDFS location.

Built-in version() function is not supported

Cloudera does not currently support the built-in version() function.

Affected Versions: Not applicable.

Bug: CDH-40979

Workaround: None.

EXPORT and IMPORT commands fail for tables or partitions with data residing on Amazon S3

The EXPORT and IMPORT commands fail when the data resides on the Amazon S3 filesystem because the default Hive configuration restricts which file systems can be used for these statements.

Bug: None.

Resolution: Use workaround.

Workaround: Add S3 to the list of supported filesystems for EXPORT and IMPORT by setting the following property in the HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml in Cloudera Manager (select Hive service > Configuration > HiveServer2):

 
<property> 
<name>hive.exim.uri.scheme.whitelist</name> 
<value>hdfs,pfile,s3a</value> 
</property>

Hive queries on MapReduce 1 cannot use Amazon S3 when the Cloudera Manager External Account feature is used

Hive queries that read or write data to Amazon S3 and use the Cloudera Manager External Account feature for S3 credential management do not work with MapReduce 1 (MRv1) because it is deprecated on CDH.

Bug: CDH-45201

Resolution: Use workaround.

Workaround: Migrate your cluster from MRv1 to MRv2. See Migrating from MapReduce (MRv1) to MapReduce (MRv2).

ALTER PARTITION does not work on Amazon S3 or between S3 and HDFS

Cloudera recommends that you do not use ALTER PARTITION on S3 or between S3 and HDFS.

Bug: CDH-42420

Hive cannot drop encrypted databases in cascade if trash is enabled

Affected Versions: CDH 5.7.0, 5.7.1, 5.7.2, 5.7.3, 5.7.4, 5.7.5, 5.7.6, 5.8.0, 5.8.1, 5.8.2, 5.8.3, 5.8.4, 5.8.5, 5.9.0, 5.9.1, 5.9.2, 5.10.0, 5.10.1, 5.11.0, 5.11.1

Fixed in Versions: CDH 5.7.7, 5.8.6, 5.9.3, 5.10.2, 5.11.2, 5.12.0

Bug: HIVE-11418, CDH-29913

Workaround: Remove each table using the PURGE keyword (DROP TABLE table PURGE). After all tables are removed, remove the empty database (DROP DATABASE database).

Potential failure of "alter table <schema>.<table> rename to <schema>.<new_table_name>"

When Hive renames a managed table, it always creates the new renamed table directory under its database directory in order to preserve the database/table hierarchy. The renamed table directory is created under the default database.

Considering that the encryption of a filesystem is part of the evolution hardening of a system (where the system and the data contained in it already exist) and a database can be already created without the location set (because it is not strictly required) and the default database is outside the same encryption zone (or in a no-encryption zone) the alter table rename operation fails.

Affected Version: CDH 5.5 only

Bug: None.

Resolution: Use workaround.

Workaround: Use the following statements:

CREATE DATABASE database_encrypted_zone LOCATION '/hdfs/encrypted_path/database_encrypted_zone';
USE database_encrypted_zone;
CREATE TABLE rename_test_table LOCATION 'hdfs/encrypted_path/database_encrypted_zone/rename_test';
ALTER TABLE rename_test_table RENAME TO test_rename_table;

The renamed table is created under the default database.

Hive upgrade from CDH 5.0.5 fails on Debian 7.0 if a Sentry 5.0.x release is installed

Upgrading Hive from CDH 5.0.5 to CDH 5.4, 5.3 or 5.2 fails with the following error if a Sentry version later than 5.0.4 and earlier than 5.1.0 is installed. You will see an error such as the following:
: error processing
    /var/cache/apt/archives/hive_0.13.1+cdh5.2.0+221-1.cdh5.2.0.p0.32~precise-cdh5.2.0_all.deb
    (--unpack):   trying to overwrite '/usr/lib/hive/lib/commons-lang-2.6.jar', which is also
    in package sentry 1.2.0+cdh5.0.5
This is because of a conflict involving commons-lang-2.6.jar.

Bug: None.

Workaround: Upgrade Sentry first and then upgrade Hive. Upgrading Sentry deletes all the JAR files that Sentry has installed under /usr/lib/hive/lib and installs them under /usr/lib/sentry/lib instead.

Hive creates an invalid table if you specify more than one partition with alter table

Hive (in all known versions from 0.7) allows you to configure multiple partitions with a single alter table command, but the configuration it creates is invalid for both Hive and Impala.

Bug: None

Resolution: Use workaround.

Workaround:

Correct results can be obtained by configuring each partition with its own alter table command in either Hive or Impala .For example, the following:
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808' PARTITION
(dt='2008-08-09', country='us') location '/path/to/us/part080809';
should be replaced with:
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808';
ALTER TABLE page_view ADD PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';

Commands run against an Oracle backed metastore may fail

Commands run against an Oracle-backed Metastore fail with error:
javax.jdo.JDODataStoreException Incompatible data type for column TBLS.VIEW_EXPANDED_TEXT : was CLOB (datastore),
but type expected was LONGVARCHAR (metadata). Please check that the type in the datastore and the type specified in the MetaData are consistent.

This error may occur if the metastore is run on top of an Oracle database with the configuration property datanucleus.validateColumns set to true.

Bug: None

Workaround: Set datanucleus.validateColumns=false in the hive-site.xml configuration file.

Hive Web Interface is not supported

Cloudera no longer supports the Hive Web Interface because of inconsistent upstream maintenance of this project.

Bug: DISTRO-77

Resolution: Use workaround

Workaround: Use Hue and Beeswax instead of the Hive Web Interface.

Hive might need additional configuration to make it work in a federated HDFS cluster

Hive jobs normally move data from a temporary directory to a warehouse directory during execution. Hive uses /tmp as its temporary directory by default, and users usually configure /user/hive/warehouse/ as the warehouse directory. Under Federated HDFS, /tmp and /user are configured as ViewFS mount tables, and so the Hive job will actually try to move data between two ViewFS mount tables. Federated HDFS does not support this, and the job will fail with the following error:
Failed with exception Renames across Mount points not supported 

Bug: None

Resolution: No software fix planned; use the workaround.

Workaround: Modify /etc/hive/conf/hive-site.xml to allow the temporary directory and warehouse directory to use the same ViewFS mount table. For example, if the warehouse directory is /user/hive/warehouse, add the following property to /etc/hive/conf/hive-site.xml so both directories use the ViewFS mount table for /user.
<property>
 <name>hive.exec.scratchdir</name>
 <value>/user/${user.name}/tmp</value>
</property> 

Cannot create archive partitions with external HAR (Hadoop Archive) tables

ALTER TABLE ... ARCHIVE PARTITION is not supported on external tables.

Bug: None

Workaround: None

Setting hive.optimize.skewjoin to true causes long running queries to fail

Bug: None

Workaround: None

Object types Server and URI are not supported in "SHOW GRANT ROLE roleName on OBJECT objectName"

Bug: None

Workaround: Use SHOW GRANT ROLE roleNameto list all privileges granted to the role.

Kerberized HS2 with LDAP authentication fails in a multi-domain LDAP case

In CDH 5.7, Hive introduced a feature to support HS2 with Kerberos plus LDAP authentication; but it broke compatibility with multi-domain LDAP cases on CDH 5.7.x and C5.8.x versions.

Affected Versions: CDH 5.7.1

Fixed in Versions: CDH 5.7.2

Bug: HIVE-13590.

Workaround: None.

With encrypted HDFS, 'drop database if exists <db_name> cascade' fails

Hive cannot drop encrypted databases in cascade if trash is enabled.

Affected Versions: CDH 5.7.0 - 5.7.6, CDH 5.8.0 - 5.8.5, CDH 5.9.0 - 5.9.2, CDH 5.10.0 - 5.10.1, CDH 5.11.0 - 5.11.1

Fixed in Versions: CDH 5.7.7, 5.8.6, 5.9.3, 5.10.2, 5.11.2, 5.12.0

Bug: HIVE-11418

Workaround: Remove each table, using the PURGE keyword (DROP TABLE table PURGE). After all tables are removed, remove the empty database (DROP DATABASE database).

HCatalog Known Issues

WebHCatalog does not work in a Kerberos-secured federated cluster

Bug: none

Resolution: None planned.

Workaround:None

Hive-on-Spark (HoS) Known Issues

Hive on Spark queries are failing with "Timed out waiting for client to connect" for an unknown reason

If this exception is preceded by logs of the form "client.RpcRetryingCaller: Call exception...". Then this failure is due to an unavailable HBase service. On a secure cluster, spark-submit will try to obtain delegation tokens from HBase, even though HoS doesn't necessarily need them. So if HBase is unavailable, then spark-submit throws an exception.

Bug: None

Versions Affected: CDH 5.7.0 and higher

Workaround: Fix the HBase service, or set spark.yarn.security.credentials.hbase.enabled to false.

Intermittent inconsistent results for Hive-on-Spark with DECIMAL-type columns

SQL queries executed using Hive-on-Spark (HoS) can produce inconsistent results intermittently if they use a GROUP BY clause (user-defined aggregate functions, or UDAFs) on DECIMAL data type columns.

Hive queries are affected if all of the following conditions are met:

  • Spark is used as the execution engine (hive.execution.engine=spark) in the Affected Versions listed below.
  • The property spark.executor.cores is set to a value greater than 1.
  • The query uses a UDAF that requires a GROUP BY clause on a DECIMAL data type column.

    For example:
    COUNT(DISTINCT<decimal_column_name>), SELECT SUM(<column>) GROUP BY<decimal_column>
                  

    You can use the EXPLAIN command to view the query execution plan to check for the presence of the GROUP BY clause on a DECIMAL data type column.

Bug: HIVE-16257, HIVE-12768

Severity: High

Affected Versions:
  • CDH 5.4.0, 5.4.1, 5.4.2, 5.4.3, 5.4.4, 5.4.5, 5.4.7, 5.4.8, 5.4.9, 5.4.10, 5.4.11
  • CDH 5.5.0, 5.5.1, 5.5.2, 5.5.3, 5.5.4, 5.5.5, 5.5.6
  • CDH 5.6.0, 5.6.1
  • CDH 5.7.0, 5.7.1, 5.7.3, 5.7.4, 5.7.5, 5.7.6
  • CDH 5.8.0, 5.8.1, 5.8.2, 5.8.3, 5.8.4
  • CDH 5.9.0, 5.9.1
  • CDH 5.10.0, 5.10.1
Resolution: Use the workaround or upgrade to a release that addresses this issue:
  • CDH 5.8.5 and higher
  • CDH 5.9.2 and higher
  • CDH 5.11.0 and higher

Workaround: To fix this issue, set spark.executor.cores=1 for sessions that run queries that use the GROUP BY clause on DECIMAL data type columns. This might cause performance degradation for some queries. Cloudera recommends that you run performance validations before setting this property.

You can set this with Cloudera Manager in HiveServer2 at the cluster level:
  1. In the Cloudera Manager Admin console, go to the Hive service, and select Configuration>HiveServer2>Performance.
  2. In the Search text box, type spark.executor.cores and press Enter.
  3. Set Spark Executor Cores to 1.
  4. Click Save Changes.

Hive-on-Spark throws exception for a multi-insert with a join query

A multi-insert combined with a join query with Hive on Spark (Hos) sometimes throws an exception. It occurs only when multiple parts of the resultant operator tree are executed on the same executor by Spark.

Affected Versions: CDH 5.7.0

Fixed in Versions: CDH 5.7.1

Bug: HIVE-13300, CDH-38458

Workaround: Run inserts one at a time.

NullPointerException thrown when a Spark session is reused to run a mapjoin

Some Hive-on-Spark (HoS) queries might fail with a NullPointerException if a Spark dependency is not set.

Bug: HIVE-12616

Workaround: Configure Hive to depend on the Spark (on YARN) service in Cloudera Manager.

Large Hive-on-Spark queries might fail in Spark tasks with ExecutorLostFailure

The root cause is java.lang.OutOfMemoryError: Unable to acquire XX bytes of memory, got 0. Spark executors can OOM because of a failure to correctly spill shuffle data from memory to disk.

Bug: None.

Workaround: Run this query using MapReduce.

Hive-on-Spark2 is not Supported

Hive-on-Spark is a CDH component that has a dependency on Spark 1.6. Because CDH components do not have any dependencies on Spark 2, Hive-on-Spark does not work with the Cloudera Distribution of Apache Spark 2.