Apache Impala (incubating) Known Issues

The following sections describe known issues and workarounds in Impala, as of the current production release. This page summarizes the most serious or frequently encountered issues in the current release, to help you make planning decisions about installing and upgrading. Any workarounds are listed here. The bug links take you to the Impala issues site, where you can see the diagnosis and whether a fix is in the pipeline.

For issues fixed in various Impala releases, see Fixed Issues in Apache Impala (incubating).

Impala Known Issues: Startup

These issues can prevent one or more Impala-related daemons from starting properly.

Impala requires FQDN from hostname command on kerberized clusters

The method Impala uses to retrieve the host name while constructing the Kerberos principal is the gethostname() system call. This function might not always return the fully qualified domain name, depending on the network configuration. If the daemons cannot determine the FQDN, Impala does not start on a kerberized cluster.

Bug: IMPALA-4978

Severity: High

Resolution: Depends on the resolution of IMPALA-4978.

Workaround: Test if a host is affected by checking whether the output of the hostname command includes the FQDN. On hosts where hostname, only returns the short name, pass the command-line flag --hostname=fully_qualified_domain_name in the startup options of all Impala-related daemons.

Impala Known Issues: Crashes and Hangs

These issues can cause Impala to quit or become unresponsive.

Altering Kudu table schema outside of Impala may result in crash on read

Creating a table in Impala, changing the column schema outside of Impala, and then reading again in Impala may result in a crash. Neither Impala nor the Kudu client validates the schema immediately before reading, so Impala may attempt to dereference pointers that aren't there. This happens if a string column is dropped and then a new, non-string column is added with the old string column's name.

Bug: IMPALA-4828

Severity: High

Workaround: Run the statement REFRESH table_name after any occasion when the table structure, such as the number, names, and data types of columns, are modified outside of Impala using the Kudu API.

Resolution: Fixed in CDH 5.12 / Impala 2.9 and higher.

Queries that take a long time to plan can cause webserver to block other queries

Trying to get the details of a query through the debug web page while the query is planning will block new queries that had not started when the web page was requested. The web UI becomes unresponsive until the planning phase is finished.

Bug: IMPALA-1972

Severity: High

Resolution: Fixed in CDH 5.12 / Impala 2.9 and higher.

Linking IR UDF module to main module crashes Impala

A UDF compiled as an LLVM module (.ll) could cause a crash when executed.

Bug: IMPALA-4595

Severity: High

Resolution: Fixed in CDH 5.10 / Impala 2.8 and higher.

Workaround: Compile the external UDFs to a .so library instead of a .ll IR module.

Setting BATCH_SIZE query option too large can cause a crash

Using a value in the millions for the BATCH_SIZE query option, together with wide rows or large string values in columns, could cause a memory allocation of more than 2 GB resulting in a crash.

Bug: IMPALA-3069

Severity: High

Resolution: Fixed in CDH 5.9 / Impala 2.7 and higher.

Impala should not crash for invalid avro serialized data

Malformed Avro data, such as out-of-bounds integers or values in the wrong format, could cause a crash when queried.

Bug: IMPALA-3441

Severity: High

Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0 and CDH 5.8.2 / Impala 2.6.2.

Queries may hang on server-to-server exchange errors

The DataStreamSender::Channel::CloseInternal() does not close the channel on an error. This causes the node on the other side of the channel to wait indefinitely, causing a hang.

Bug: IMPALA-2592

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

Impalad is crashing if udf jar is not available in hdfs location for first time

If the JAR file corresponding to a Java UDF is removed from HDFS after the Impala CREATE FUNCTION statement is issued, the impalad daemon crashes.

Bug: IMPALA-2365

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

Impala Known Issues: Performance

These issues involve the performance of operations such as queries or DDL statements.

Slow queries for Parquet tables with convert_legacy_hive_parquet_utc_timestamps=true

The configuration setting convert_legacy_hive_parquet_utc_timestamps=true uses an underlying function that can be a bottleneck on high volume, highly concurrent queries due to the use of a global lock while loading time zone information. This bottleneck can cause slowness when querying Parquet tables, up to 30x for scan-heavy queries. The amount of slowdown depends on factors such as the number of cores and number of threads involved in the query.

Bug: IMPALA-3316

Severity: High

Workaround: If the TIMESTAMP values stored in the table represent dates only, with no time portion, consider storing them as strings in yyyy-MM-dd format. Impala implicitly converts such string values to TIMESTAMP in calls to date/time functions.

Slow DDL statements for tables with large number of partitions

DDL statements for tables with a large number of partitions might be slow.

Bug: IMPALA-1480

Workaround: Run the DDL statement in Hive if the slowness is an issue.

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

Interaction of File Handle Cache with HDFS Appends and Short-Circuit Reads

If a data file used by Impala is being continuously appended or overwritten in place by an HDFS mechanism, such as hdfs dfs -appendToFile, interaction with the file handle caching feature in CDH 5.13 / Impala 2.10 and higher could cause short-circuit reads to sometimes be disabled on some DataNodes. When a mismatch is detected between the cached file handle and a data block that was rewritten because of an append, short-circuit reads are turned off on the affected host for a 10-minute period.

The possibility of encountering such an issue is the reason why the file handle caching feature is currently turned off by default. See Scalability Considerations for NameNode Traffic with File Handle Caching for information about this feature and how to enable it.

Bug: HDFS-12528

Severity: High

Workaround: Verify whether your ETL process is susceptible to this issue before enabling the file handle caching feature. You can set the impalad configuration option unused_file_handle_timeout_sec to a time period that is shorter than the HDFS setting dfs.client.read.shortcircuit.streams.cache.expiry.ms. (Keep in mind that the HDFS setting is in milliseconds while the Impala setting is in seconds.)

Impala Known Issues: Usability

These issues affect the convenience of interacting directly with Impala, typically through the Impala shell or Hue.

Impala shell tarball is not usable on systems with setuptools versions where '0.7' is a substring of the full version string

For example, this issue could occur on a system using setuptools version 20.7.0.

Bug: IMPALA-4570

Severity: High

Resolution: Fixed in CDH 5.10 / Impala 2.8 and higher.

Workaround: Change to a setuptools version that does not have 0.7 as a substring.

Unexpected privileges in show output

Due to a timing condition in updating cached policy data from Sentry, the SHOW statements for Sentry roles could sometimes display out-of-date role settings. Because Impala rechecks authorization for each SQL statement, this discrepancy does not represent a security issue for other statements.

Bug: IMPALA-3133

Severity: High

Resolution: Fixes have been issued for some but not all CDH / Impala releases. Check the JIRA for details of fix releases.

Resolution: Fixed in CDH 5.8.0 / Impala 2.6.0 and CDH 5.7.1 / Impala 2.5.1.

Less than 100% progress on completed simple SELECT queries

Simple SELECT queries show less than 100% progress even though they are already completed.

Bug: IMPALA-1776

Unexpected column overflow behavior with INT datatypes

Impala does not return column overflows as NULL, so that customers can distinguish between NULL data and overflow conditions similar to how they do so with traditional database systems. Impala returns the largest or smallest value in the range for the type. For example, valid values for a tinyint range from -128 to 127. In Impala, a tinyint with a value of -200 returns -128 rather than NULL. A tinyint with a value of 200 returns 127.

Bug: IMPALA-3123

Impala Known Issues: JDBC and ODBC Drivers

These issues affect applications that use the JDBC or ODBC APIs, such as business intelligence tools or custom-written applications in languages such as Java or C++.

ImpalaODBC: Can not get the value in the SQLGetData(m-x th column) after the SQLBindCol(m th column)

If the ODBC SQLGetData is called on a series of columns, the function calls must follow the same order as the columns. For example, if data is fetched from column 2 then column 1, the SQLGetData call for column 1 returns NULL.

Bug: IMPALA-1792

Workaround: Fetch columns in the same order they are defined in the table.

Impala Known Issues: Security

These issues relate to security features, such as Kerberos authentication, Sentry authorization, encryption, auditing, and redaction.

Kerberos tickets must be renewable

In a Kerberos environment, the impalad daemon might not start if Kerberos tickets are not renewable.

Workaround: Configure your KDC to allow tickets to be renewed, and configure krb5.conf to request renewable tickets.

Impala Known Issues: Resources

These issues involve memory or disk usage, including out-of-memory conditions, the spill-to-disk feature, and resource management features.

Handling Large Rows During Upgrade to CDH 5.13 / Impala 2.10 or Higher

After an upgrade to CDH 5.13 / Impala 2.10 or higher, users who process very large column values (long strings), or have increased the --read_size configuration setting from its default of 8 MB, might encounter capacity errors for some queries that previously worked.

Bug: IMPALA-6028

Severity: High

Resolution: After the upgrade, follow the instructions in CDH 5.13 / Impala 2.10 to check if your queries are affected by these changes and to modify your configuration settings if so.

Configuration to prevent crashes caused by thread resource limits

Impala could encounter a serious error due to resource usage under very high concurrency. The error message is similar to:

F0629 08:20:02.956413 29088 llvm-codegen.cc:111] LLVM hit fatal error: Unable to allocate section memory!
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::thread_resource_error> >'

Bug: IMPALA-5605

Severity: High

Workaround: To prevent such errors, configure each host running an impalad daemon with the following settings:

echo 2000000 > /proc/sys/kernel/threads-max
echo 2000000 > /proc/sys/kernel/pid_max
echo 8000000 > /proc/sys/vm/max_map_count

Add the following lines in /etc/security/limits.conf:

impala soft nproc 262144
impala hard nproc 262144

Memory usage when compact_catalog_topic flag enabled

The efficiency improvement from IMPALA-4029 can cause an increase in size of the updates to Impala catalog metadata that are broadcast to the impalad daemons by the statestored daemon. The increase in catalog update topic size results in higher CPU and network utilization. By default, the increase in topic size is about 5-7%. If the compact_catalog_topic flag is used, the size increase is more substantial, with a topic size approximately twice as large as in previous versions.

Bug: IMPALA-5500

Severity: Medium

Workaround: Consider leaving the compact_catalog_topic configuration setting at its default value of false until this issue is resolved.

Resolution: A fix is in the pipeline. Check the status of IMPALA-5500 for the release where the fix is available.

Kerberos initialization errors due to high memory usage

On a kerberized cluster with high memory utilization, kinit commands executed after every 'kerberos_reinit_interval' may cause out-of-memory errors, because executing the command involves a fork of the Impala process. The error looks similar to the following:
Failed to obtain Kerberos ticket for principal: <varname>principal_details</varname>
Failed to execute shell cmd: 'kinit -k -t <varname>keytab_details</varname>',
error was: Error(12): Cannot allocate memory

Bug: IMPALA-2294

Severity: High

Workaround:

The following command changes the vm.overcommit_memory setting immediately on a running host. However, this setting is reset when the host is restarted.
echo 1 > /proc/sys/vm/overcommit_memory

To change the setting in a persistent way, add the following line to the /etc/sysctl.conf file:
vm.overcommit_memory=1

Then run sysctl -p. No reboot is needed.

DROP TABLE PURGE on S3A table may not delete externally written files

A DROP TABLE PURGE statement against an S3 table could leave the data files behind, if the table directory and the data files were created with a combination of hadoop fs and aws s3 commands.

Bug: IMPALA-3558

Severity: High

Resolution: The underlying issue with the S3A connector depends on the resolution of HADOOP-13230.

Impala catalogd heap issues when upgrading to 5.7

The default heap size for Impala catalogd has changed in CDH 5.7 / Impala 2.5 and higher:

  • Before 5.7, by default catalogd was using the JVM's default heap size, which is the smaller of 1/4th of the physical memory or 32 GB.

  • Starting with CDH 5.7.0, the default catalogd heap size is 4 GB.

For example, on a host with 128GB physical memory this will result in catalogd heap decreasing from 32GB to 4GB. This can result in out-of-memory errors in catalogd and leading to query failures.

Severity: High

Workaround: Increase the catalogd memory limit as follows.

For schemas with large numbers of tables, partitions, and data files, the catalogd daemon might encounter an out-of-memory error. To increase the memory limit for the catalogd daemon:
  1. Check current memory usage for the catalogd daemon by running the following commands on the host where that daemon runs on your cluster:

      jcmd catalogd_pid VM.flags
      jmap -heap catalogd_pid
      
  2. Decide on a large enough value for the catalogd heap. You express it as an environment variable value as follows:

      JAVA_TOOL_OPTIONS="-Xmx8g"
      
  3. On systems managed by Cloudera Manager, include this value in the configuration field Java Heap Size of Catalog Server in Bytes (Cloudera Manager 5.7 and higher), or Impala Catalog Server Environment Advanced Configuration Snippet (Safety Valve) (prior to Cloudera Manager 5.7). Then restart the Impala service.

  4. On systems not managed by Cloudera Manager, put this environment variable setting into the startup script for the catalogd daemon, then restart the catalogd daemon.

  5. Use the same jcmd and jmap commands as earlier to verify that the new settings are in effect.

Breakpad minidumps can be very large when the thread count is high

The size of the breakpad minidump files grows linearly with the number of threads. By default, each thread adds 8 KB to the minidump size. Minidump files could consume significant disk space when the daemons have a high number of threads.

Bug: IMPALA-3509

Severity: High

Workaround: Add --minidump_size_limit_hint_kb=size to set a soft upper limit on the size of each minidump file. If the minidump file would exceed that limit, Impala reduces the amount of information for each thread from 8 KB to 2 KB. (Full thread information is captured for the first 20 threads, then 2 KB per thread after that.) The minidump file can still grow larger than the "hinted" size. For example, if you have 10,000 threads, the minidump file can be more than 20 MB.

Parquet scanner memory increase after IMPALA-2736

The initial release of CDH 5.8 / Impala 2.6 sometimes has a higher peak memory usage than in previous releases while reading Parquet files.

CDH 5.8 / Impala 2.6 addresses the issue IMPALA-2736, which improves the efficiency of Parquet scans by up to 2x. The faster scans may result in a higher peak memory consumption compared to earlier versions of Impala due to the new column-wise row materialization strategy. You are likely to experience higher memory consumption in any of the following scenarios:
  • Very wide rows due to projecting many columns in a scan.

  • Very large rows due to big column values, for example, long strings or nested collections with many items.

  • Producer/consumer speed imbalances, leading to more rows being buffered between a scan (producer) and downstream (consumer) plan nodes.

Bug: IMPALA-3662

Severity: High

Workaround: The following query options might help to reduce memory consumption in the Parquet scanner:
  • Reduce the number of scanner threads, for example: set num_scanner_threads=30
  • Reduce the batch size, for example: set batch_size=512
  • Increase the memory limit, for example: set mem_limit=64g

Resolution: Fixed in CDH 5.10 / Impala 2.8.

Process mem limit does not account for the JVM's memory usage

Some memory allocated by the JVM used internally by Impala is not counted against the memory limit for the impalad daemon.

Bug: IMPALA-691

Workaround: To monitor overall memory usage, use the top command, or add the memory figures in the Impala web UI /memz tab to JVM memory usage shown on the /metrics tab.

Fix issues with the legacy join and agg nodes using --enable_partitioned_hash_join=false and --enable_partitioned_aggregation=false

Bug: IMPALA-2375

Workaround: Transition away from the "old-style" join and aggregation mechanism if practical.

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

Impala Known Issues: Correctness

These issues can cause incorrect or unexpected results from queries. They typically only arise in very specific circumstances.

DECIMAL AVG() can return incorrect results in Impala

When both an AVG() over a DECIMAL value and a DISTINCT aggregate (for example, COUNT(DISTINCT ...)) appear in the same SELECT statement, the DECIMAL AVG() can return incorrect results.

Bug: IMPALA-5251 (now resolved)

Severity: High

Releases affected: CDH 5.11.0 only

Immediate action required:

  • If you use Impala and are not yet on CDH 5.11.0, wait for CDH 5.11.1 to upgrade.

  • If you use Impala and are in process of upgrading to CDH 5.11.0, or must upgrade before CDH 5.11.1 becomes available, contact Cloudera Support to get a patch to avoid this issue.

  • If you are on CDH 5.11.0 and use Impala, contact Cloudera Support to get a patch to avoid this issue.

  • If you do not use Impala, you are not affected.

Cannot execute IR UDF when single node execution is enabled

Impala may generate an incorrect plan, and therefore incorrect results, for queries that have a correlated scalar subquery as a parameter to a NULL-checking conditional function such as isnull().

Bug: IMPALA-4373

Severity: High

Resolution:

Workaround:

Cannot execute IR UDF when single node execution is enabled

A UDF compiled into an LLVM IR bitcode module (.bc) would have undefined effects when native code generation was turned off, for example when Impala applied the single-node optimization for small queries.

Bug: IMPALA-4432

Severity: High

Resolution: In CDH 5.10 / Impala 2.8 and higher, Impala returns an error if the UDF cannot run because of this issue.

Workaround: Turn native code generation back on with the query option setting DISABLE_CODEGEN=0.

ABS(n) where n is the lowest bound for the int types returns negative values

If the abs() function evaluates a number that is right at the lower bound for an integer data type, the positive result cannot be represented in the same type, and the result is returned as a negative number. For example, abs(-128) returns -128 because the argument is interpreted as a TINYINT and the return value is also a TINYINT.

Bug: IMPALA-4513

Severity: High

Workaround: Cast the integer value to a larger type. For example, rewrite abs(tinyint_col) as abs(cast(tinyint_col as smallint)).

Java udf expression returning string in group by can give incorrect results.

If the GROUP BY clause included a call to a Java UDF that returned a string value, the UDF could return an incorrect result.

Bug: IMPALA-4266

Severity: High

Resolution: Fixed in CDH 5.10 / Impala 2.8 and higher.

Workaround: Rewrite the expression to concatenate the results of the Java UDF with an empty string call. For example, rewrite my_hive_udf() as concat(my_hive_udf(), '').

Incorrect assignment of NULL checking predicate through an outer join of a nested collection.

A query could return wrong results (too many or too few NULL values) if it referenced an outer-joined nested collection and also contained a null-checking predicate (IS NULL, IS NOT NULL, or the <=> operator) in the WHERE clause.

Bug: IMPALA-3084

Severity: High

Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0.

Incorrect result due to constant evaluation in query with outer join

An OUTER JOIN query could omit some expected result rows due to a constant such as FALSE in another join clause. For example:

explain SELECT 1 FROM alltypestiny a1
  INNER JOIN alltypesagg a2 ON a1.smallint_col = a2.year AND false
  RIGHT JOIN alltypes a3 ON a1.year = a1.bigint_col;
+---------------------------------------------------------+
| Explain String                                          |
+---------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=1.00KB VCores=1 |
|                                                         |
| 00:EMPTYSET                                             |
+---------------------------------------------------------+

Bug: IMPALA-3094

Severity: High

Resolution:

Workaround:

Incorrect assignment of an inner join On-clause predicate through an outer join.

Impala may return incorrect results for queries that have the following properties:

  • There is an INNER JOIN following a series of OUTER JOINs.

  • The INNER JOIN has an On-clause with a predicate that references at least two tables that are on the nullable side of the preceding OUTER JOINs.

The following query demonstrates the issue:

select 1 from functional.alltypes a left outer join
  functional.alltypes b on a.id = b.id left outer join
  functional.alltypes c on b.id = c.id right outer join
  functional.alltypes d on c.id = d.id inner join functional.alltypes e
on b.int_col = c.int_col;

The following listing shows the incorrect EXPLAIN plan:

+-----------------------------------------------------------+
| Explain String                                            |
+-----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=480.04MB VCores=4 |
|                                                           |
| 14:EXCHANGE [UNPARTITIONED]                               |
| |                                                         |
| 08:NESTED LOOP JOIN [CROSS JOIN, BROADCAST]               |
| |                                                         |
| |--13:EXCHANGE [BROADCAST]                                |
| |  |                                                      |
| |  04:SCAN HDFS [functional.alltypes e]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |                                                         |
| 07:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]              |
| |  hash predicates: c.id = d.id                           |
| |  runtime filters: RF000 <- d.id                         |
| |                                                         |
| |--12:EXCHANGE [HASH(d.id)]                               |
| |  |                                                      |
| |  03:SCAN HDFS [functional.alltypes d]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |                                                         |
| 06:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]               |
| |  hash predicates: b.id = c.id                           |
| |  other predicates: b.int_col = c.int_col     <--- incorrect placement; should be at node 07 or 08
| |  runtime filters: RF001 <- c.int_col                    |
| |                                                         |
| |--11:EXCHANGE [HASH(c.id)]                               |
| |  |                                                      |
| |  02:SCAN HDFS [functional.alltypes c]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |     runtime filters: RF000 -> c.id                      |
| |                                                         |
| 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]              |
| |  hash predicates: b.id = a.id                           |
| |  runtime filters: RF002 <- a.id                         |
| |                                                         |
| |--10:EXCHANGE [HASH(a.id)]                               |
| |  |                                                      |
| |  00:SCAN HDFS [functional.alltypes a]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |                                                         |
| 09:EXCHANGE [HASH(b.id)]                                  |
| |                                                         |
| 01:SCAN HDFS [functional.alltypes b]                      |
|    partitions=24/24 files=24 size=478.45KB                |
|    runtime filters: RF001 -> b.int_col, RF002 -> b.id     |
+-----------------------------------------------------------+

Bug: IMPALA-3126

Severity: High

Workaround: High

For some queries, this problem can be worked around by placing the problematic ON clause predicate in the WHERE clause instead, or changing the preceding OUTER JOINs to INNER JOINs (if the ON clause predicate would discard NULLs). For example, to fix the problematic query above:

select 1 from functional.alltypes a
  left outer join functional.alltypes b
    on a.id = b.id
  left outer join functional.alltypes c
    on b.id = c.id
  right outer join functional.alltypes d
    on c.id = d.id
  inner join functional.alltypes e
where b.int_col = c.int_col

+-----------------------------------------------------------+
| Explain String                                            |
+-----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=480.04MB VCores=4 |
|                                                           |
| 14:EXCHANGE [UNPARTITIONED]                               |
| |                                                         |
| 08:NESTED LOOP JOIN [CROSS JOIN, BROADCAST]               |
| |                                                         |
| |--13:EXCHANGE [BROADCAST]                                |
| |  |                                                      |
| |  04:SCAN HDFS [functional.alltypes e]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |                                                         |
| 07:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]              |
| |  hash predicates: c.id = d.id                           |
| |  other predicates: b.int_col = c.int_col          <-- correct assignment
| |  runtime filters: RF000 <- d.id                         |
| |                                                         |
| |--12:EXCHANGE [HASH(d.id)]                               |
| |  |                                                      |
| |  03:SCAN HDFS [functional.alltypes d]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |                                                         |
| 06:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]               |
| |  hash predicates: b.id = c.id                           |
| |                                                         |
| |--11:EXCHANGE [HASH(c.id)]                               |
| |  |                                                      |
| |  02:SCAN HDFS [functional.alltypes c]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |     runtime filters: RF000 -> c.id                      |
| |                                                         |
| 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]              |
| |  hash predicates: b.id = a.id                           |
| |  runtime filters: RF001 <- a.id                         |
| |                                                         |
| |--10:EXCHANGE [HASH(a.id)]                               |
| |  |                                                      |
| |  00:SCAN HDFS [functional.alltypes a]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |                                                         |
| 09:EXCHANGE [HASH(b.id)]                                  |
| |                                                         |
| 01:SCAN HDFS [functional.alltypes b]                      |
|    partitions=24/24 files=24 size=478.45KB                |
|    runtime filters: RF001 -> b.id                         |
+-----------------------------------------------------------+

Impala may use incorrect bit order with BIT_PACKED encoding

Parquet BIT_PACKED encoding as implemented by Impala is LSB first. The parquet standard says it is MSB first.

Bug: IMPALA-3006

Severity: High, but rare in practice because BIT_PACKED is infrequently used, is not written by Impala, and is deprecated in Parquet 2.0.

BST between 1972 and 1995

The calculation of start and end times for the BST (British Summer Time) time zone could be incorrect between 1972 and 1995. Between 1972 and 1995, BST began and ended at 02:00 GMT on the third Sunday in March (or second Sunday when Easter fell on the third) and fourth Sunday in October. For example, both function calls should return 13, but actually return 12, in a query such as:

select
  extract(from_utc_timestamp(cast('1970-01-01 12:00:00' as timestamp), 'Europe/London'), "hour") summer70start,
  extract(from_utc_timestamp(cast('1970-12-31 12:00:00' as timestamp), 'Europe/London'), "hour") summer70end;

Bug: IMPALA-3082

Severity: High

parse_url() returns incorrect result if @ character in URL

If a URL contains an @ character, the parse_url() function could return an incorrect value for the hostname field.

Bug: https://issues.cloudera.org/browse/IMPALA-1170IMPALA-1170

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0 and CDH 5.5.4 / Impala 2.3.4.

% escaping does not work correctly when occurs at the end in a LIKE clause

If the final character in the RHS argument of a LIKE operator is an escaped \% character, it does not match a % final character of the LHS argument.

Bug: IMPALA-2422

ORDER BY rand() does not work.

Because the value for rand() is computed early in a query, using an ORDER BY expression involving a call to rand() does not actually randomize the results.

Bug: IMPALA-397

Duplicated column in inline view causes dropping null slots during scan

If the same column is queried twice within a view, NULL values for that column are omitted. For example, the result of COUNT(*) on the view could be less than expected.

Bug: IMPALA-2643

Workaround: Avoid selecting the same column twice within an inline view.

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.10 / Impala 2.2.10.

Incorrect assignment of predicates through an outer join in an inline view.

A query involving an OUTER JOIN clause where one of the table references is an inline view might apply predicates from the ON clause incorrectly.

Bug: IMPALA-1459

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.9 / Impala 2.2.9.

Crash: impala::Coordinator::ValidateCollectionSlots

A query could encounter a serious error if includes multiple nested levels of INNER JOIN clauses involving subqueries.

Bug: IMPALA-2603

Incorrect assignment of On-clause predicate inside inline view with an outer join.

A query might return incorrect results due to wrong predicate assignment in the following scenario:

  1. There is an inline view that contains an outer join
  2. That inline view is joined with another table in the enclosing query block
  3. That join has an On-clause containing a predicate that only references columns originating from the outer-joined tables inside the inline view

Bug: IMPALA-2665

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.9 / Impala 2.2.9.

Wrong assignment of having clause predicate across outer join

In an OUTER JOIN query with a HAVING clause, the comparison from the HAVING clause might be applied at the wrong stage of query processing, leading to incorrect results.

Bug: IMPALA-2144

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

Wrong plan of NOT IN aggregate subquery when a constant is used in subquery predicate

A NOT IN operator with a subquery that calls an aggregate function, such as NOT IN (SELECT SUM(...)), could return incorrect results.

Bug: IMPALA-2093

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0 and CDH 5.5.4 / Impala 2.3.4.

Impala Known Issues: Metadata

These issues affect how Impala interacts with metadata. They cover areas such as the metastore database, the COMPUTE STATS statement, and the Impala catalogd daemon.

Catalogd may crash when loading metadata for tables with many partitions, many columns and with incremental stats

Incremental stats use up about 400 bytes per partition for each column. For example, for a table with 20K partitions and 100 columns, the memory overhead from incremental statistics is about 800 MB. When serialized for transmission across the network, this metadata exceeds the 2 GB Java array size limit and leads to a catalogd crash.

Bugs: IMPALA-2647, IMPALA-2648, IMPALA-2649

Workaround: If feasible, compute full stats periodically and avoid computing incremental stats for that table. The scalability of incremental stats computation is a continuing work item.

Can't update stats manually via alter table after upgrading to CDH 5.2

Bug: IMPALA-1420

Workaround: On CDH 5.2, when adjusting table statistics manually by setting the numRows, you must also enable the Boolean property STATS_GENERATED_VIA_STATS_TASK. For example, use a statement like the following to set both properties with a single ALTER TABLE statement:

ALTER TABLE table_name SET TBLPROPERTIES('numRows'='new_value', 'STATS_GENERATED_VIA_STATS_TASK' = 'true');

Resolution: The underlying cause is the issue HIVE-8648 that affects the metastore in Hive 0.13. The workaround is only needed until the fix for this issue is incorporated into a CDH release.

Impala Known Issues: Interoperability

These issues affect the ability to interchange data between Impala and other systems. They cover areas such as data types and file formats.

CREATE TABLE AS SELECT (CTAS) fails to write to HDFS

This issue can occur on clusters where HDFS NameNode high availability is enabled. The CREATE TABLE AS SELECT statement fails to open the HDFS file for writing. When this condition occurs an error is thrown in the Hue UI and logs that starts with "Failed to open HDFS file for writing:".

Severity: High

Workaround:To write data to HDFS on clusters where HDFS NameNode high availability is enabled, manually create and populate the table using the CREATE TABLE statement followed by INSERT.

DESCRIBE FORMATTED gives error on Avro table

This issue can occur either on old Avro tables (created prior to Hive 1.1 / CDH 5.4) or when changing the Avro schema file by adding or removing columns. Columns added to the schema file will not show up in the output of the DESCRIBE FORMATTED command. Removing columns from the schema file will trigger a NullPointerException.

As a workaround, you can use the output of SHOW CREATE TABLE to drop and recreate the table. This will populate the Hive metastore database with the correct column definitions.

Severity: High

Deviation from Hive behavior: Impala does not do implicit casts between string and numeric and boolean types.

Anticipated Resolution: None

Workaround: Use explicit casts.

Deviation from Hive behavior: Out of range values float/double values are returned as maximum allowed value of type (Hive returns NULL)

Impala behavior differs from Hive with respect to out of range float/double values. Out of range values are returned as maximum allowed value of type (Hive returns NULL).

Workaround: None

Configuration needed for Flume to be compatible with Impala

For compatibility with Impala, the value for the Flume HDFS Sink hdfs.writeFormat must be set to Text, rather than its default value of Writable. The hdfs.writeFormat setting must be changed to Text before creating data files with Flume; otherwise, those files cannot be read by either Impala or Hive.

Resolution: This information has been requested to be added to the upstream Flume documentation.

Avro Scanner fails to parse some schemas

Querying certain Avro tables could cause a crash or return no rows, even though Impala could DESCRIBE the table.

Bug: IMPALA-635

Workaround: Swap the order of the fields in the schema specification. For example, ["null", "string"] instead of ["string", "null"].

Resolution: Not allowing this syntax agrees with the Avro specification, so it may still cause an error even when the crashing issue is resolved.

Impala BE cannot parse Avro schema that contains a trailing semi-colon

If an Avro table has a schema definition with a trailing semicolon, Impala encounters an error when the table is queried.

Bug: IMPALA-1024

Severity: Remove trailing semicolon from the Avro schema.

Fix decompressor to allow parsing gzips with multiple streams

Currently, Impala can only read gzipped files containing a single stream. If a gzipped file contains multiple concatenated streams, the Impala query only processes the data from the first stream.

Bug: IMPALA-2154

Workaround: Use a different gzip tool to compress file to a single stream file.

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

Impala incorrectly handles text data when the new line character \n\r is split between different HDFS block

If a carriage return / newline pair of characters in a text table is split between HDFS data blocks, Impala incorrectly processes the row following the \n\r pair twice.

Bug: IMPALA-1578

Workaround: Use the Parquet format for large volumes of data where practical.

Resolution: Fixed in CDH 5.8.0 / Impala 2.6.0.

Invalid bool value not reported as a scanner error

In some cases, an invalid BOOLEAN value read from a table does not produce a warning message about the bad value. The result is still NULL as expected. Therefore, this is not a query correctness issue, but it could lead to overlooking the presence of invalid data.

Bug: IMPALA-1862

Incorrect results with basic predicate on CHAR typed column.

When comparing a CHAR column value to a string literal, the literal value is not blank-padded and so the comparison might fail when it should match.

Bug: IMPALA-1652

Workaround: Use the RPAD() function to blank-pad literals compared with CHAR columns to the expected length.

Impala Known Issues: Limitations

These issues are current limitations of Impala that require evaluation as you plan how to integrate Impala into your data management workflow.

Set limits on size of expression trees

Very deeply nested expressions within queries can exceed internal Impala limits, leading to excessive memory usage.

Bug: IMPALA-4551

Severity: High

Resolution:

Workaround: Avoid queries with extremely large expression trees. Setting the query option disable_codegen=true may reduce the impact, at a cost of longer query runtime.

Impala does not support running on clusters with federated namespaces

Impala does not support running on clusters with federated namespaces. The impalad process will not start on a node running such a filesystem based on the org.apache.hadoop.fs.viewfs.ViewFs class.

Bug: IMPALA-77

Anticipated Resolution: Limitation

Workaround: Use standard HDFS on all Impala nodes.

Impala Known Issues: Miscellaneous / Older Issues

These issues do not fall into one of the above categories or have not been categorized yet.

A failed CTAS does not drop the table if the insert fails.

If a CREATE TABLE AS SELECT operation successfully creates the target table but an error occurs while querying the source table or copying the data, the new table is left behind rather than being dropped.

Bug: IMPALA-2005

Workaround: Drop the new table manually after a failed CREATE TABLE AS SELECT.

Casting scenarios with invalid/inconsistent results

Using a CAST() function to convert large literal values to smaller types, or to convert special values such as NaN or Inf, produces values not consistent with other database systems. This could lead to unexpected results from queries.

Bug: IMPALA-1821

Support individual memory allocations larger than 1 GB

The largest single block of memory that Impala can allocate during a query is 1 GiB. Therefore, a query could fail or Impala could crash if a compressed text file resulted in more than 1 GiB of data in uncompressed form, or if a string function such as group_concat() returned a value greater than 1 GiB.

Bug: IMPALA-1619

Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0 and CDH 5.8.3 / Impala 2.6.3.

Impala Parser issue when using fully qualified table names that start with a number.

A fully qualified table name starting with a number could cause a parsing error. In a name such as db.571_market, the decimal point followed by digits is interpreted as a floating-point number.

Bug: IMPALA-941

Workaround: Surround each part of the fully qualified name with backticks (``).

Impala should tolerate bad locale settings

If the LC_* environment variables specify an unsupported locale, Impala does not start.

Bug: IMPALA-532

Workaround: Add LC_ALL="C" to the environment settings for both the Impala daemon and the Statestore daemon. See Modifying Impala Startup Options for details about modifying these environment settings.

Resolution: Fixing this issue would require an upgrade to Boost 1.47 in the Impala distribution.

Log Level 3 Not Recommended for Impala

The extensive logging produced by log level 3 can cause serious performance overhead and capacity issues.

Workaround: Reduce the log level to its default value of 1, that is, GLOG_v=1. See Setting Logging Levels for details about the effects of setting different logging levels.