Impala Known Issues
The following sections describe known issues and workarounds in Impala.
For issues fixed in various Impala releases, see Impala Fixed Issues.
Known Issues in the Current Production Release (Impala 2.3.x / CDH 5.5.x)
These known issues affect the current release. Any workarounds are listed here. The bug links take you to the Impala issues site, where you can see the diagnosis and whether a fix is in the pipeline.
- Wrong assignment of having clause predicate across outer join
- Wrong plan of NOT IN aggregate subquery when a constant is used in subquery predicate
- Incorrect results with basic predicate on CHAR typed column
- Incorrect assignment of predicates through an outer join in an inline view
- Incorrect assignment of On-clause predicate inside inline view with an outer join
- Crash: impala::Coordinator::ValidateCollectionSlots
- Fix issues with the legacy join and agg nodes using --enable_partitioned_hash_join=false and --enable_partitioned_aggregation=false
- Invalid Boolean value not reported as a scanner error
- ImpalaODBC: Cannot get the value in the SQLGetData(m-x th column) after the SQLBindCol(m th column)
- Impala incorrectly handles text data when the newline character \n\r is split between different HDFS blocks
- Duplicated column in inline view causes dropping null slots during scan
- A failed CTAS does not drop the table if the insert fails
- Casting scenarios with invalid/inconsistent results
- Server-to-server SSL and Kerberos do not work together
- Queries may hang on server-to-server exchange errors
- Catalogd may crash when loading metadata for tables with many partitions, many columns and with incremental stats
- Fix decompressor to allow parsing gzips with multiple streams
- Less than 100% progress on completed simple SELECT queries
- Slow DDL statements for tables with large number of partitions
- Support individual memory allocations larger than 1 GB
- Cannot update stats manually via alter table after upgrading to CDH 5.2
- ORDER BY rand() does not work
- Impala BE cannot parse Avro schema that contains a trailing semi-colon
- Process mem limit does not account for JVM memory usage
- Impala parser issue when using fully qualified table names that start with a number
- CatalogServer should not require HBase to be up to reload its metadata
- Kerberos tickets must be renewable
- Avro Scanner fails to parse some schemas
- Configuration needed for Flume to be compatible with Impala
- Impala does not support running on clusters with federated namespaces
- Deviation from Hive behavior: Out-of-range float/double values are returned as maximum allowed value of type (Hive returns NULL)
- Deviation from Hive behavior: Impala does not do implicit casts between string and numeric and Boolean types
- If Hue and Impala are installed on the same host, and if you configure Hue Beeswax in CDH 4.1 to execute Impala queries, Beeswax cannot list Hive tables and shows an error on Beeswax startup
- Impala should tolerate bad locale settings
- Log Level 3 Not Recommended for Impala
Wrong assignment of having clause predicate across outer join
In an OUTER JOIN query with a HAVING clause, the comparison from the HAVING clause might be applied at the wrong stage of query processing, leading to incorrect results.
Wrong plan of NOT IN aggregate subquery when a constant is used in subquery predicate
A NOT IN operator with a subquery that calls an aggregate function, such as NOT IN (SELECT SUM(...)), could return incorrect results.
Incorrect results with basic predicate on CHAR typed column
When comparing a CHAR column value to a string literal, the literal value is not blank-padded, so the comparison might fail when it should match.
Workaround: Use the RPAD() function to blank-pad literals compared with CHAR columns to the expected length.
Incorrect assignment of predicates through an outer join in an inline view
A query involving an OUTER JOIN clause, where one of the table references is an inline view, might apply predicates from the ON clause incorrectly.
Incorrect assignment of On-clause predicate inside inline view with an outer join
A query might return incorrect results due to wrong predicate assignment in the following scenario:
- An inline view contains an outer join.
- That inline view is joined with another table in the enclosing query block.
- That join has an On-clause containing a predicate that only references columns originating from the outer-joined tables inside the inline view.
A query could encounter a serious error if it includes multiple nested levels of INNER JOIN clauses involving subqueries.
Fix issues with the legacy join and agg nodes using --enable_partitioned_hash_join=false and --enable_partitioned_aggregation=false
Workaround: Transition away from the
Invalid Boolean value not reported as a scanner error
In some cases, an invalid BOOLEAN value read from a table does not produce a warning message about the bad value. The result is still NULL as expected. This is not a query correctness issue, but could lead to overlooking the presence of invalid data.
ImpalaODBC: Cannot get the value in the SQLGetData(m-x th column) after the SQLBindCol(m th column)
If the ODBC SQLGetData is called on a series of columns, the function calls must follow the same order as the columns. For example, if data is fetched from column 2 and then column 1, the SQLGetData call for column 1 returns NULL.
Workaround: Fetch columns in the same order that they are defined in the table.
Impala incorrectly handles text data when the newline character \n\r is split between different HDFS blocks
If a carriage return / newline pair of characters in a text table is split between HDFS data blocks, Impala incorrectly processes the row following the \n\r pair twice.
Workaround: Use the Parquet format for large volumes of data where practical.
Duplicated column in inline view causes dropping null slots during scan
If the same column is queried twice within a view, NULL values for that column are omitted. For example, the result of COUNT(*) on the view could be less than expected.
Workaround: Avoid selecting the same column twice within an inline view.
A failed CTAS does not drop the table if the insert fails
If a CREATE TABLE AS SELECT operation successfully creates the target table, but an error occurs while querying the source table or copying the data, the new table is left behind instead of being dropped.
Workaround: Drop the new table manually after a failed CREATE TABLE AS SELECT.
Casting scenarios with invalid/inconsistent results
Using a CAST() function to convert large literal values to smaller types, or to convert special values such as NaN or Inf, produces values inconsistent with other database systems. This could lead to unexpected results from queries.
Server-to-server SSL and Kerberos do not work together
If SSL is enabled between internal Impala components (with ssl_client_ca_certificate), and Kerberos authentication is used between servers, the cluster fails to start.
Severity: Medium; the ssl_client_ca_certificate setting is a new feature, so the issue does not affect existing cluster configurations
Workaround: Do not use the new ssl_client_ca_certificate setting on Kerberos-enabled clusters until this issue is resolved.
Queries may hang on server-to-server exchange errors
The DataStreamSender::Channel::CloseInternal() does not close the channel on an error. This causes the node on the other side of the channel to wait indefinitely, causing a hang.
Severity: Low. This issue does not occur frequently.
Catalogd may crash when loading metadata for tables with many partitions, many columns and with incremental stats
Incremental stats use about 400 bytes per partition for each column. For example, for a table with 20K partitions and 100 columns, the memory overhead from incremental statistics is about 800 MB. When serialized for transmission across the network, this metadata exceeds the 2 GB Java array size limit and leads to a catalogd crash.
Severity: Low. This does not occur frequently.
Workaround: If feasible, compute full stats periodically and avoid computing incremental stats for that table. The scalability of incremental stats computation is a continuing work item.
Fix decompressor to allow parsing gzips with multiple streams
Currently, Impala can only read gzipped files containing a single stream. If a gzipped file contains multiple concatenated streams, the Impala query only processes the data from the first stream.
Workaround: Use a different gzip tool to compress the file to a single stream file.
Less than 100% progress on completed simple SELECT queries
Simple SELECT queries show less than 100% progress even though they are already completed.
Slow DDL statements for tables with large number of partitions
DDL statements for tables with a large number of partitions might be slow.
Workaround: Run the DDL statement in Hive if the slowness is an issue.
Support individual memory allocations larger than 1 GB
The largest single block of memory that Impala can allocate during a query is 1 GiB. Therefore, a query could fail or Impala could crash if a compressed text file resulted in more than 1 GiB of data in uncompressed form, or if a string function such as group_concat() returned a value greater than 1 GiB.
Cannot update stats manually via alter table after upgrading to CDH 5.2
Workaround: On CDH 5.2, when adjusting table statistics manually by setting the numRows, you must also enable the Boolean property STATS_GENERATED_VIA_STATS_TASK. For example, use a statement like the following to set both properties with a single ALTER TABLE statement:
ALTER TABLE table_name SET TBLPROPERTIES('numRows'='new_value', 'STATS_GENERATED_VIA_STATS_TASK' = 'true');
Resolution: The underlying cause is the issue HIVE-8648 that affects the metastore in Hive 0.13. The workaround is only needed until the fix for this issue is incorporated into a CDH release.
ORDER BY rand() does not work
Because the value for rand() is computed early in a query, using an ORDER BY expression involving a call to rand() does not actually randomize the results.
Impala BE cannot parse Avro schema that contains a trailing semi-colon
If an Avro table has a schema definition with a trailing semicolon, Impala encounters an error when the table is queried.
Severity: Remove trailing semicolon from the Avro schema.
Process mem limit does not account for JVM memory usage
Some memory allocated by the JVM used internally by Impala is not counted against the memory limit for the impalad daemon.
Workaround: To monitor overall memory usage, use the top command, or add the memory figures in the Impala web UI /memz tab to JVM memory usage shown on the /metrics tab.
Impala parser issue when using fully qualified table names that start with a number
A fully qualified table name starting with a number could cause a parsing error. In a name such as db.571_market, the decimal point followed by digits is interpreted as a floating-point number.
Workaround: Surround each part of the fully qualified name with backticks (``).
CatalogServer should not require HBase to be up to reload its metadata
If HBase is unavailable during Impala startup or after an INVALIDATE METADATA statement, the catalogd daemon could go into an error loop, making Impala unresponsive.
Workaround: For systems not managed by Cloudera Manager, add the following settings to /etc/impala/conf/hbase-site.xml:
<property> <name>hbase.client.retries.number</name> <value>3</value> </property> <property> <name>hbase.rpc.timeout</name> <value>3000</value> </property>
Currently, Cloudera Manager does not have an Impala-only override for HBase settings, so any HBase configuration change you make through Cloudera Manager would take affect for all HBase applications. Therefore, this change is not recommended on systems managed by Cloudera Manager.
Kerberos tickets must be renewable
In a Kerberos environment, the impalad daemon might not start if Kerberos tickets are not renewable.
Workaround: Configure your KDC to allow tickets to be renewed, and configure krb5.conf to request renewable tickets.
Avro Scanner fails to parse some schemas
Querying certain Avro tables could cause a crash or return no rows, even though Impala could DESCRIBE the table.
Workaround: Swap the order of the fields in the schema specification. For example, ["null", "string"] instead of ["string", "null"].
Resolution: Not allowing this syntax agrees with the Avro specification, so it may still cause an error even when the crashing issue is resolved.
Configuration needed for Flume to be compatible with Impala
For compatibility with Impala, the value for the Flume HDFS sink hdfs.writeFormat must be set to Text, instead of its default value of Writable. The hdfs.writeFormat setting must be changed to Text before creating data files with Flume; otherwise, those files cannot be read by either Impala or Hive.
Resolution: This information has been requested to be added to the upstream Flume documentation.
Impala does not support running on clusters with federated namespaces
Impala does not support running on clusters with federated namespaces. The impalad process will not start on a node running such a filesystem based on the org.apache.hadoop.fs.viewfs.ViewFs class.
Anticipated Resolution: Limitation
Workaround: Use standard HDFS on all Impala nodes.
Deviation from Hive behavior: Out-of-range float/double values are returned as maximum allowed value of type (Hive returns NULL)
Impala behavior differs from Hive with respect to out-of-range float/double values. Out-of-range values are returned as maximum allowed value of type (Hive returns NULL).
Deviation from Hive behavior: Impala does not do implicit casts between string and numeric and Boolean types
Anticipated Resolution: None
Workaround: Use explicit casts.
If Hue and Impala are installed on the same host, and if you configure Hue Beeswax in CDH 4.1 to execute Impala queries, Beeswax cannot list Hive tables and shows an error on Beeswax startup
Hue requires Beeswaxd to be running to list the Hive tables. Because of a port conflict bug in Hue in CDH 4.1, when Hue and Impala are installed on the same host, an error page is displayed when you start the Beeswax application, and when you open the Tables page in Beeswax.
Anticipated Resolution: Fixed in an upcoming CDH4 release
Workarounds: Choose one of the following workarounds (but only one):
- Install Hue and Impala on different hosts. OR
Upgrade to CDH4.1.2 and add the following property in the beeswax section of the
/etc/hue/hue.ini configuration file:
If you are using CDH4.1.1 and you want to install Hue and Impala on the same host, change the code in
Replace line 66:
With this line:
Beeswaxd will then use port 8004.Note
If you used Cloudera Manager to install Impala, refer to the Cloudera Manager release notes for information about using an equivalent workaround by specifying the beeswax_meta_server_only=9004 configuration value in the options field for Hue. In Cloudera Manager 4, these fields are labelled Safety Valve; in Cloudera Manager 5, they are called Advanced Configuration Snippet.
Impala should tolerate bad locale settings
If the LC_* environment variables specify an unsupported locale, Impala does not start.
Workaround: Add LC_ALL="C" to the environment settings for both the Impala daemon and the Statestore daemon. See Modifying Impala Startup Options for details about modifying these environment settings.
Resolution: Fixing this issue would require an upgrade to Boost 1.47 in the Impala distribution.
Log Level 3 Not Recommended for Impala
The extensive logging produced by log level 3 can cause serious performance overhead and capacity issues.
Workaround: Reduce the log level to its default value of 1, that is, GLOG_v=1. See Setting Logging Levels for details about the effects of setting different logging levels.
|<< Hue Known Issues||Apache Mahout Known Issues >>|