Apache Impala Known Issues

The following sections describe known issues and workarounds in Impala, as of the current production release. This page summarizes the most serious or frequently encountered issues in the current release, to help you make planning decisions about installing and upgrading. Any workarounds are listed here. The bug links take you to the Impala issues site, where you can see the diagnosis and whether a fix is in the pipeline.

For issues fixed in various Impala releases, see Fixed Issues in Apache Impala.

Impala/Sentry security roles mismatch after Catalog Server restart

This issue occurs when Impala’s Catalog Server is restarted without also restarting all the Impala Daemons.

Impala uses generated numeric identifiers for roles. These identifiers are regenerated during catalogd restarts, and the same role can get a different identifier, possibly used by a different role before restart. An impalad’s metadata cache can contain old id + role pairs, and when it is updated with privileges with new role ids from the catalog, the privilege will be added to the wrong role; the one that previously had the same role id.

Products affected: Apache Impala

Releases affected:
  • CDH 5.14.4 and all prior releases

Users affected: Impala users with authorization enabled.

Date/time of detection: 5th October, 2018

Severity (Low/Medium/High): 3.8 "Low"; CVSS:3.0/AV:N/AC:H/PR:H/UI:R/S:U/C:L/I:L/A:L

Impact: Users may get privileges of unrelated users.

CVE: CVE-2019-16381

Immediate action required: Update to a version of CDH containing the fix.

Addressed in release/refresh/patch: CDH 5.15.0

Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2020-348: Impala/Sentry security roles mismatch after Catalog Server restart

Timestamp type-casted to varchar in a binary predicate can produce incorrect result

In an Impala query the timestamp can be type-casted to a varchar of smaller length to convert a timestamp value to a date string. However, if such Impala query is used in a binary comparison against a string literal, it can produce incorrect results, because of a bug in the expression rewriting code. The following is an example of this:
> select * from (select cast('2018-12-11 09:59:37' as timestamp) as ts) tbl where cast(ts as varchar(10)) = '2018-12-11';
The output will have 0 rows.

Products affected: Apache Impala

Releases affected:
  • CDH 5.15.0, 5.15.1, 5.15.2, 5.16.0, 5.16.1
  • CDH 6.0.0, 6.0.1, 6.1.0, 6.1.1

For the latest update on this issue see the corresponding Knowledge article:TSB 2019-358: Timestamp type-casted to varchar in a binary predicate can produce incorrect result

XSS Cloudera Manager

Malicious Impala queries can result in Cross Site Scripting (XSS) when viewed in Cloudera Manager.

Products affected: Apache Impala

Releases affected:
  • Cloudera Manager 5.13.x, 5.14.x, 5.15.1, 5.15.2, 5.16.1
  • Cloudera Manager 6.0.0, 6.0.1, 6.1.0

Users affected: All Cloudera Manager Users

Date/time of detection: November 2018

Severity (Low/Medium/High): High

Impact: When a malicious user generates a piece of JavaScript in the impala-shell and then goes to the Queries tab of the Impala service in Cloudera Manager, that piece of JavaScript code gets evaluated, resulting in an XSS.

CVE: CVE-2019-14449

Immediate action required: There is no workaround, upgrade to the latest available maintenance release.

Addressed in release/refresh/patch:
  • Cloudera Manager 5.16.2
  • Cloudera Manager 6.0.2, 6.1.1, 6.2.0, 6.3.0

Impala Known Issues: Startup

These issues can prevent one or more Impala-related daemons from starting properly.

Problem retrieving FQDN causes startup problem on kerberized clusters

The method Impala uses to retrieve the host name while constructing the Kerberos principal is the gethostname() system call. This function might not always return the fully qualified domain name, depending on the network configuration. If the daemons cannot determine the FQDN, Impala does not start on a kerberized cluster.

This problem might occur immediately after an upgrade of a CDH cluster, due to changes in Cloudera Manager that supplies the --hostname flag automatically to the Impala-related daemons. (See the issue "hostname parameter is not passed to Impala catalog role" at the Cloudera Manager Known Issues page.)

Bugs: IMPALA-4978, IMPALA-5253

Severity: High

Resolution: The issue is expected to occur less frequently on systems with fixes for IMPALA-4978, IMPALA-5253, or both. Even on systems with fixes for both of these issues, the workaround might still be required in some cases.

Workaround: Test if a host is affected by checking whether the output of the hostname command includes the FQDN. On hosts where hostname only returns the short name, pass the command-line flag --hostname=fully_qualified_domain_name in the startup options of all Impala-related daemons.

Impala Known Issues: Crashes and Hangs

These issues can cause Impala to quit or become unresponsive.

Unable to view large catalog objects in catalogd Web UI

In catalogd Web UI, you can list metadata objects and view their details. These details are accessed via a link and printed to a string formatted using thrift's DebugProtocol. Printing large objects (> 1 GB) in Web UI can crash catalogd.

Bug: IMPALA-6841

Crash when querying tables with "\0" as a row delimiter

When querying a textfile-based Impala table that uses \0 as a new line separator, Impala crashes.

The following sequence causes impalad to crash:

create table tab_separated(id bigint, s string, n int, t timestamp, b boolean)
  row format delimited
  fields terminated by '\t' escaped by '\\' lines terminated by '\000'
  stored as textfile;
select * from tab_separated; -- Done. 0 results.
insert into tab_separated (id, s) values (100, ''); -- Success.
select * from tab_separated; -- 20 second delay before getting "Cancelled due to unreachable impalad(s): xxxx:22000"

Bug: IMPALA-6389

Workaround: Use an alternative delimiter, e.g. \001.

Resolution: Fixed in Impala 2.13 and higher.

Altering Kudu table schema outside of Impala may result in crash on read

Creating a table in Impala, changing the column schema outside of Impala, and then reading again in Impala may result in a crash. Neither Impala nor the Kudu client validates the schema immediately before reading, so Impala may attempt to dereference pointers that aren't there. This happens if a string column is dropped and then a new, non-string column is added with the old string column's name.

Bug: IMPALA-4828

Severity: High

Workaround: Run the statement REFRESH table_name after any occasion when the table structure, such as the number, names, and data types of columns, are modified outside of Impala using the Kudu API.

Resolution: Fixed in CDH 5.12 / Impala 2.9 and higher.

Queries that take a long time to plan can cause webserver to block other queries

Trying to get the details of a query through the debug web page while the query is planning will block new queries that had not started when the web page was requested. The web UI becomes unresponsive until the planning phase is finished.

Bug: IMPALA-1972

Severity: High

Resolution: Fixed in CDH 5.12 / Impala 2.9 and higher.

Linking IR UDF module to main module crashes Impala

A UDF compiled as an LLVM module (.ll) could cause a crash when executed.

Bug: IMPALA-4595

Severity: High

Resolution: Fixed in CDH 5.10 / Impala 2.8 and higher.

Workaround: Compile the external UDFs to a .so library instead of a .ll IR module.

Secure Impala-Kudu clusters need to be restarted frequently

Due to KUDU-2264, secure, long-running Impala clusters on CDH 5.13.x and later will be unable to interact with Kudu tables after some time because the Kudu client used by Impala will no longer be able to authenticate with Kudu. The problem starts after the client's ticket can no longer be renewed, which is typically after 7 days. Restarting the Impala service will allow Impala to work with Kudu again.

Impala will begin failing to read or write Kudu tables with errors in the Impala daemon logs resembling:
W1121 22:47:47.231425 50416 ConnectToCluster.java:302] Error receiving response from
cdh-master-4db933ef.cdh-cluster.internal:7051
Java exception follows:
org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, but this client is not
authenticated (kinit)

at org.apache.kudu.client.Negotiator.evaluateChallenge(Negotiator.java:705)
at org.apache.kudu.client.Negotiator.sendSaslInitiate(Negotiator.java:581)
at org.apache.kudu.client.Negotiator.startAuthentication(Negotiator.java:545)
at org.apache.kudu.client.Negotiator.handleTlsMessage(Negotiator.java:499)
at org.apache.kudu.client.Negotiator.handleResponse(Negotiator.java:264)
at org.apache.kudu.client.Negotiator.messageReceived(Negotiator.java:231)
at ...
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at org.apache.kudu.client.Negotiator$1.run(Negotiator.java:691)
at org.apache.kudu.client.Negotiator$1.run(Negotiator.java:688)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.kudu.client.Negotiator.evaluateChallenge(Negotiator.java:687)
... 35 more

Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any
Kerberos tgt)

at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
... 40 more
Alternatively, errors may be seen like the following:
I0102 11:58:36.688992 3587 jni-util.cc:196]
org.apache.impala.common.ImpalaRuntimeException: Unable to initialize the Kudu scan node

at org.apache.impala.planner.KuduScanNode.init(KuduScanNode.java:133)
at org.apache.impala.planner.SingleNodePlanner.createScanNode(SingleNodePlanner.java:1312)
at org.apache.impala.planner.SingleNodePlanner.createTableRefNode(SingleNodePlanner.java:1519)
at org.apache.impala.planner.SingleNodePlanner.createTableRefsPlan(SingleNodePlanner.java:779)
at org.apache.impala.planner.SingleNodePlanner.createSelectPlan(SingleNodePlanner.java:617)
at org.apache.impala.planner.SingleNodePlanner.createQueryPlan(SingleNodePlanner.java:260)
at org.apache.impala.planner.SingleNodePlanner.createSingleNodePlan(SingleNodePlanner.java:150
at org.apache.impala.planner.Planner.createPlan(Planner.java:98)
at org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1004)
at org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1100)
at org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:156)

Caused by: org.apache.kudu.client.NonRecoverableException: cannot re-acquire authentication
token after 5 attempts

at org.apache.kudu.client.AuthnTokenReacquirer$1NewAuthnTokenErrB.failQueuedRpcs(AuthnTokenReacquirer.java:167)
at org.apache.kudu.client.AuthnTokenReacquirer$1NewAuthnTokenErrB.call(AuthnTokenReacquirer.java:159)
at org.apache.kudu.client.AuthnTokenReacquirer$1NewAuthnTokenErrB.call(AuthnTokenReacquirer.java:142)
at com.stumbleupon.async.Deferred.doCall(Deferred.java:1262)
at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1241)
at com.stumbleupon.async.Deferred.callback(Deferred.java:989)
at org.apache.kudu.client.ConnectToCluster.incrementCountAndCheckExhausted(ConnectToCluster.java:223)
at org.apache.kudu.client.ConnectToCluster.access$000(ConnectToCluster.java:48)
at org.apache.kudu.client.ConnectToCluster$ConnectToMasterErrCB.call(ConnectToCluster.java:304)
at org.apache.kudu.client.ConnectToCluster$ConnectToMasterErrCB.call(ConnectToCluster.java:293)
at com.stumbleupon.async.Deferred.doCall(Deferred.java:1262)
at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1241)
at com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1297)

Products Affected: Impala

Releases Affected: CDH 5.13.0, CDH 5.13.1, and CDH 5.14.0

Users Affected: Users running a secure cluster with Impala and Kudu.

Severity: Medium

Impact: Impala will not be able to query or write to Kudu tables until the service is restarted. The Impala service will need to be restarted periodically in order to continue being able to work with Kudu tables.

Immediate action required: To work around the issue, users can restart the Impala service at least once per ticket renewal period. This period is typically 7 days.

Resolution: Addressed in CDH 5.13.2, CDH 5.13.3, and CDH 5.14.2

Setting BATCH_SIZE query option too large can cause a crash

Using a value in the millions for the BATCH_SIZE query option, together with wide rows or large string values in columns, could cause a memory allocation of more than 2 GB resulting in a crash.

Bug: IMPALA-3069

Severity: High

Resolution: Fixed in CDH 5.9 / Impala 2.7 and higher.

Impala should not crash for invalid avro serialized data

Malformed Avro data, such as out-of-bounds integers or values in the wrong format, could cause a crash when queried.

Bug: IMPALA-3441

Severity: High

Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0 and CDH 5.8.2 / Impala 2.6.2.

Queries may hang on server-to-server exchange errors

The DataStreamSender::Channel::CloseInternal() does not close the channel on an error. This causes the node on the other side of the channel to wait indefinitely, causing a hang.

Bug: IMPALA-2592

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

Impalad is crashing if udf jar is not available in hdfs location for first time

If the JAR file corresponding to a Java UDF is removed from HDFS after the Impala CREATE FUNCTION statement is issued, the impalad daemon crashes.

Bug: IMPALA-2365

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

Impala Known Issues: Performance

These issues involve the performance of operations such as queries or DDL statements.

Metadata operations block read-only operations on unrelated tables

Metadata operations that change the state of a table, like COMPUTE STATS or ALTER RECOVER PARTITIONS, may delay metadata propagation of unrelated unloaded tables triggered by statements like DESCRIBE or SELECT queries.

Bug: IMPALA-6671

Profile timers not updated during long-running sort

If you have a query plan with a long-running sort operation, e.g. minutes, the profile timers are not updated to reflect the time spent in the sort until the sort starts returning rows.

Bug: IMPALA-5200

Workaround: Slow sorts can be identified by looking at "Peak Mem" in the summary or "PeakMemoryUsage" in the profile. If a sort is consuming multiple GB of memory per host, it will likely spend a significant amount of time sorting the data.

Slow queries for Parquet tables with convert_legacy_hive_parquet_utc_timestamps=true

The configuration setting convert_legacy_hive_parquet_utc_timestamps=true uses an underlying function that can be a bottleneck on high volume, highly concurrent queries due to the use of a global lock while loading time zone information. This bottleneck can cause slowness when querying Parquet tables, up to 30x for scan-heavy queries. The amount of slowdown depends on factors such as the number of cores and number of threads involved in the query.

Bug: IMPALA-3316

Severity: High

Workaround:Store the TIMESTAMP values as strings in one of the following formats:
  • yyyy-MM-dd
  • yyyy-MM-dd HH:mm:ss
  • yyyy-MM-dd HH:mm:ss.SSSSSSSSS

    The date can have the 1-9 digits in the fractional part.

Impala implicitly converts such string values to TIMESTAMP in calls to date/time functions.

Slow DDL statements for tables with large number of partitions

DDL statements for tables with a large number of partitions might be slow.

Bug: IMPALA-1480

Workaround: Run the DDL statement in Hive if the slowness is an issue.

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

Interaction of File Handle Cache with HDFS Appends and Short-Circuit Reads

If a data file used by Impala is being continuously appended or overwritten in place by an HDFS mechanism, such as hdfs dfs -appendToFile, interaction with the file handle caching feature in CDH 5.13 / Impala 2.10 and higher could cause short-circuit reads to sometimes be disabled on some DataNodes. When a mismatch is detected between the cached file handle and a data block that was rewritten because of an append, short-circuit reads are turned off on the affected host for a 10-minute period.

The possibility of encountering such an issue is the reason why the file handle caching feature is currently turned off by default. See Scalability Considerations for NameNode Traffic with File Handle Caching for information about this feature and how to enable it.

Bug: HDFS-12528

Severity: High

Workaround: Verify whether your ETL process is susceptible to this issue before enabling the file handle caching feature. You can set the impalad configuration option unused_file_handle_timeout_sec to a time period that is shorter than the HDFS setting dfs.client.read.shortcircuit.streams.cache.expiry.ms. (Keep in mind that the HDFS setting is in milliseconds while the Impala setting is in seconds.)

Resolution: Fixed in HDFS 2.10 and higher. Use the new HDFS parameter dfs.domain.socket.disable.interval.seconds to specify the amount of time that short circuit reads are disabled on encountering an error. The default value is 10 minutes (600 seconds). It is recommended that you set dfs.domain.socket.disable.interval.seconds to a small value, such as 1 second, when using the file handle cache. Setting dfs.domain.socket.disable.interval.seconds to 0 is not recommended as a non-zero interval protects the system if there is a persistent problem with short circuit reads.

Impala Known Issues: Usability

These issues affect the convenience of interacting directly with Impala, typically through the Impala shell or Hue.

Impala shell tarball is not usable on systems with setuptools versions where '0.7' is a substring of the full version string

For example, this issue could occur on a system using setuptools version 20.7.0.

Bug: IMPALA-4570

Severity: High

Resolution: Fixed in CDH 5.10 / Impala 2.8 and higher.

Workaround: Change to a setuptools version that does not have 0.7 as a substring.

Unexpected privileges in show output

Due to a timing condition in updating cached policy data from Sentry, the SHOW statements for Sentry roles could sometimes display out-of-date role settings. Because Impala rechecks authorization for each SQL statement, this discrepancy does not represent a security issue for other statements.

Bug: IMPALA-3133

Severity: High

Resolution: Fixed in CDH 5.8.0 / Impala 2.6.0 and CDH 5.7.1 / Impala 2.5.1.

Less than 100% progress on completed simple SELECT queries

Simple SELECT queries show less than 100% progress even though they are already completed.

Bug: IMPALA-1776

Unexpected column overflow behavior with INT datatypes

Impala does not return column overflows as NULL, so that customers can distinguish between NULL data and overflow conditions similar to how they do so with traditional database systems. Impala returns the largest or smallest value in the range for the type. For example, valid values for a tinyint range from -128 to 127. In Impala, a tinyint with a value of -200 returns -128 rather than NULL. A tinyint with a value of 200 returns 127.

Bug: IMPALA-3123

Resolution: Fixed in CDH 5.8.0 and higher / Impala 2.6.0

Impala Known Issues: JDBC and ODBC Drivers

These issues affect applications that use the JDBC or ODBC APIs, such as business intelligence tools or custom-written applications in languages such as Java or C++.

ImpalaODBC: Can not get the value in the SQLGetData(m-x th column) after the SQLBindCol(m th column)

If the ODBC SQLGetData is called on a series of columns, the function calls must follow the same order as the columns. For example, if data is fetched from column 2 then column 1, the SQLGetData call for column 1 returns NULL.

Bug: IMPALA-1792

Workaround: Fetch columns in the same order they are defined in the table.

Impala Known Issues: Security

These issues are related to security features, such as Kerberos authentication, Sentry authorization, encryption, auditing, and redaction.

Missing authorization in Apache Impala may allow data injection

A malicious user who is authenticated with Kerberos may have unauthorized access to internal services used by Impala to transfer intermediate data during query execution. If details of a running query (e.g. query ID, query plan) are available, a user can craft some RPC requests with custom software to inject data into a running query or end query execution prematurely, leading to wrong results of the query.

Products affected: Apache Impala

Releases affected: CDH 5.15.0, CDH 5.15.1

Users affected: Any users of Impala who have configured Kerberos security

Date/time of detection: Aug 21, 2018

Detected by: Cloudera, Inc.

Severity (Low/Medium/High): 4.5 "Medium"; CVSS:3.0/AV:A/AC:H/PR:L/UI:R/S:U/C:N/I:H/A:N/E:P/RL:T/RC:C/IR:H/MAV:A/MAC:H/MPR:L/MUI:R

Impact: Data injection may lead to wrong results of queries.

CVE: CVE-2018-11785

Immediate action required: Upgrade to a version which contains the fix or as a workaround, disable KRPC by setting --use_krpc=false in the Impala Command Line Argument Advanced Configuration Snippet (Safety Valve). The workaround will disable some improvements in stability and performance implemented in CDH 5.15.0 for highly concurrent workloads.

Addressed in release/refresh/patch: CDH 5.15.2, CDH 5.16.1 and higher

Impala does not support Heimdal Kerberos

Heimdal Kerberos is not supported in Impala.

Bug: IMPALA-7072

Affected Versions: All versions of Impala

Transient kerberos authentication error during table loading

A transient Kerberos error can cause a table to get into a bad state with an error: Failed to load metadata for table.

Bug: IMPALA-4712

Severity: High

Workaround: Resolve the Kerberos authentication problem and run INVALIDATE METADATA on the affected table.

In Impala with Sentry enabled, REVOKE ALL ON SERVER does not remove the privileges granted with the GRANT option

If you grant a role the ALL privilege at the SERVER scope with the WITH GRANT OPTION clause, you cannot revoke the privilege. Although the SHOW GRANT ROLE command will show that the privilege has been revoked immediately after you run the command, the ALL privilege will reappear when you run the SHOW GRANT ROLE command after Sentry refreshes.

Immediate Action Required: Once the privilege has been granted, the only way to remove it is to delete the role.

Affected Versions: CDH 6.0.0, CDH 6.0.1, CDH 5.15.0, CDH 5.15.1, CDH 5.14.x and all prior releases

Fixed Versions: CDH 5.16.1, CDH 5.15.2

Cloudera Issue: TSB-341

Malicious user can gain unauthorized access to Kudu table data via Impala

A malicious user with ALTER permissions on an Impala table can access any other Kudu table data by altering the table properties to make it "external" and then changing the underlying table mapping to point to other Kudu tables. This violates and works around the authorization requirement that creating a Kudu external table via Impala requires an ALL privilege at the server scope. This privilege requirement for CREATE commands is enforced to precisely avoid this scenario where a malicious user can change the underlying Kudu table mapping. The fix is to enforce the same privilege requirement for ALTER commands that would make existing non-external Kudu tables external.

Bug: IMPALA-5638

Severity: High

Workaround: A temporary workaround is to revoke ALTER permissions on Impala tables.

Resolution: Fixed in CDH 5.11.2 / Impala 2.10 and higher

Catalog server's kerberos ticket gets deleted after 'ticket_lifetime' on SLES11

On SLES11, after 'ticket_lifetime', the kerberos ticket gets deleted by the Java krb5 library.

Bug: IMPALA-6726

Severity: High

Workaround: On Impala 2.11.0, set --use_kudu_kinit=false in Impala startup flag.

On Impala 2.12.0, set --use_kudu_kinit=false and --use_krpc=false in Impala startup flags.

Kerberos tickets must be renewable

In a Kerberos environment, the impalad daemon might not start if Kerberos tickets are not renewable.

Workaround: Configure your KDC to allow tickets to be renewed, and configure krb5.conf to request renewable tickets.

Impala does not allow the use of insecure clusters with public IPs

Starting in CDH 5.15 / Impala 2.12, Impala, by default, will only allow unencrypted or unauthenticated connections from trusted subnets: 127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.0.0/16. Unencrypted or unauthenticated connections from publicly routable IPs will be rejected.

The trusted subnets can be configured using the --trusted_subnets flag. Set it to '0.0.0.0/0' to allow unauthenticated connections from all remote IP addresses. However, if network access is not otherwise restricted by a firewall, malicious users may be able to gain unauthorized access.

Impala user not added to /etc/passwd when LDAP is enabled

When using Impala with LDAP enabled, a user may hit the following:

Not authorized: Client connection negotiation failed: client connection to 127.0.0.1:27000: SASL(-1): generic failure: All-whitespace username.
The following sequence can lead to the impala user not being created in /etc/passwd on some machines on the cluster.
  • Time 1: The impala user is not in LDAP. Impala was installed on machine 1, and the user impala is created in /etc/passwd.
  • Time 2: The impala user is added to LDAP.
  • Time 3: A new machine is added to the cluster. When adding Impala service to this new machine, adding the impala user will fail as it already exists in LDAP.

The consequence is that the impala user doesn't exist in /etc/passwd on the new machine, leading to the error above.

Workaround: Manually edit /etc/passwd to add the impala user

Bug: IMPALA-7585

Affected Versions: CDH 5.15

Fixed Version: CDH 5.15.2, CDH 6.1

Kerberos authentication fails with the reverse DNS lookup disabled

Kerberos authentication does not function correctly if rdns = false is configured in krb5.conf. If the flag rdns = false, when Impala tries to match principals, it will fail because Kerberos receives a SPN (Service Principal Name) with an IP address in it, but Impala expects a principal with a FQDN in it.

Bug: IMPALA-7298

Affected Versions: CDH 5.15

Workaround: Set the following flags in krb5.conf:
  • dns_canonicalize_hostname = true
  • rdns = true

Fixed Version: CDH 5.15.1

System-wide auth-to-local mapping not applied correctly to Kudu service account

Due to system auth_to_local mapping, the principal may be mapped to some local name.

When running with Kerberos enabled, you may hit the following error message where <random-string> is some random string which doesn't match the primary in the Kerberos principal.

WARNINGS: TransmitData() to X.X.X.X:27000 failed: Remote error: Not authorized: {username='<random-string>', principal='impala/redacted'} is not allowed to access DataStreamService

Bug: KUDU-2198

Affected Versions: CDH 5.15 and higher

Workaround: Start Impala with the --use_system_auth_to_local=false flag to ignore the system-wide auth_to_local mappings configured in /etc/krb5.conf.

Kudu client in Impala FE doesn't renew/reacquire Kerberos tickets

When using Kudu with Impala on a secure cluster, the Kerberos ticket used by Kudu can expire even though the ticket used by Impala is renewed. Impala is then unable to access Kudu until the impalad daemon is restarted.

Cloudera Bug: CDH-63934

Affected versions: CDH 5.13.0, CDH 5.13.1

Severity: High

Workaround: Set a very long Kerberos ticket lifetime, for example 1 year.

Resolution: Upgrade to CDH 5.13.2 or higher.

Authorization error for SHOW CREATE VIEW

The SHOW CREATE VIEW statement returns an authorization error when the view references built-in functions.

Bug: IMPALA-7325

Affected Versions: CDH 5.15.1 and lower

Workaround: Execute following statement from Impala as an admin user. role refers to the role of the user who needs to execute the SHOW CREATE VIEW statement.
GRANT SELECT ON DATABASE `_impala_builtins` TO role;

Resolution: Fixed in CDH 5.16.1 and higher.

Impala Known Issues: Resources

These issues involve memory or disk usage, including out-of-memory conditions, the spill-to-disk feature, and resource management features.

Handling large rows during upgrade to CDH 5.13 / Impala 2.10 or higher

After an upgrade to CDH 5.13 / Impala 2.10 or higher, users who process very large column values (long strings), or have increased the --read_size configuration setting from its default of 8 MB, might encounter capacity errors for some queries that previously worked.

Bug: IMPALA-6028

Severity: High

Resolution: After the upgrade, follow the instructions in CDH 5.13 / Impala 2.10 to check if your queries are affected by these changes and to modify your configuration settings if so.

Configuration to prevent crashes caused by thread resource limits

Impala could encounter a serious error due to resource usage under very high concurrency. The error message is similar to:

F0629 08:20:02.956413 29088 llvm-codegen.cc:111] LLVM hit fatal error: Unable to allocate section memory!
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::thread_resource_error> >'

Bug: IMPALA-5605

Severity: High

Workaround:

In CDH 5.14 and lower versions, configure each host running an impalad daemon with the following settings:

echo 2000000 > /proc/sys/kernel/threads-max
echo 2000000 > /proc/sys/kernel/pid_max
echo 8000000 > /proc/sys/vm/max_map_count

In CDH 5.15 and higher versions, it is unlikely that you will hit the thread resource limit. Configure each host running an impalad daemon with the following setting:

echo 8000000 > /proc/sys/vm/max_map_count
To make the above settings durable, refer to your OS documentation. For example, on RHEL 6.x:
  1. Add the following line to /etc/sysctl.conf:
    vm.max_map_count=8000000
  2. Run the following command:
    sysctl -p

Memory usage when compact_catalog_topic flag enabled

The efficiency improvement from IMPALA-4029 can cause an increase in size of the updates to Impala catalog metadata that are broadcast to the impalad daemons by the statestored daemon. The increase in catalog update topic size results in higher CPU and network utilization. By default, the increase in topic size is about 5-7%. If the compact_catalog_topic flag is used, the size increase is more substantial, with a topic size approximately twice as large as in previous versions.

Bug: IMPALA-5500

Severity: Medium

Workaround: Consider setting the compact_catalog_topic configuration setting to false until this issue is resolved.

Resolution: Fixed in CDH 5.12.1 and CDH 5.13.0 and higher / Impala 2.10

Kerberos initialization errors due to high memory usage

On a kerberized cluster with high memory utilization, kinit commands executed after every 'kerberos_reinit_interval' may cause out-of-memory errors, because executing the command involves a fork of the Impala process. The error looks similar to the following:
Failed to obtain Kerberos ticket for principal: <varname>principal_details</varname>
Failed to execute shell cmd: 'kinit -k -t <varname>keytab_details</varname>',
error was: Error(12): Cannot allocate memory

Bug: IMPALA-2294

Severity: High

Workaround:

The following command changes the vm.overcommit_memory setting immediately on a running host. However, this setting is reset when the host is restarted.
echo 1 > /proc/sys/vm/overcommit_memory

To change the setting in a persistent way, add the following line to the /etc/sysctl.conf file:
vm.overcommit_memory=1

Then run sysctl -p. No reboot is needed.

Resolution: Fixed in CDH 5.14.0 and higher / Impala 2.11.0

DROP TABLE PURGE on S3A table may not delete externally written files

A DROP TABLE PURGE statement against an S3 table could leave the data files behind, if the table directory and the data files were created with a combination of hadoop fs and aws s3 commands.

Bug: IMPALA-3558

Severity: High

Resolution: The underlying issue with the S3A connector depends on the resolution of HADOOP-13230.

Impala catalogd heap issues when upgrading to 5.7

The default heap size for Impala catalogd has changed in CDH 5.7 / Impala 2.5 and higher:

  • Before 5.7, by default catalogd was using the JVM's default heap size, which is the smaller of 1/4th of the physical memory or 32 GB.

  • Starting with CDH 5.7.0, the default catalogd heap size is 4 GB.

For example, on a host with 128GB physical memory this will result in catalogd heap decreasing from 32GB to 4GB. This can result in out-of-memory errors in catalogd and leading to query failures.

Severity: High

Workaround: Increase the catalogd memory limit as follows.

For schemas with large numbers of tables, partitions, and data files, the catalogd daemon might encounter an out-of-memory error. To increase the memory limit for the catalogd daemon:
  1. Check current memory usage for the catalogd daemon by running the following commands on the host where that daemon runs on your cluster:

      jcmd catalogd_pid VM.flags
      jmap -heap catalogd_pid
      
  2. Decide on a large enough value for the catalogd heap.
    • On systems managed by Cloudera Manager, include this value in the configuration field Java Heap Size of Catalog Server in Bytes (Cloudera Manager 5.7 and higher), or Impala Catalog Server Environment Advanced Configuration Snippet (Safety Valve) (prior to Cloudera Manager 5.7). Then restart the Impala service.

    • On systems not managed by Cloudera Manager, put the JAVA_TOOL_OPTIONS environment variable setting into the startup script for the catalogd daemon, then restart the catalogd daemon.

      For example, the following environment variable setting specifies the maximum heap size of 8 GB.

        JAVA_TOOL_OPTIONS="-Xmx8g"
        
  3. Use the same jcmd and jmap commands as earlier to verify that the new settings are in effect.

Breakpad minidumps can be very large when the thread count is high

The size of the breakpad minidump files grows linearly with the number of threads. By default, each thread adds 8 KB to the minidump size. Minidump files could consume significant disk space when the daemons have a high number of threads.

Bug: IMPALA-3509

Severity: High

Workaround: Add --minidump_size_limit_hint_kb=size to set a soft upper limit on the size of each minidump file. If the minidump file would exceed that limit, Impala reduces the amount of information for each thread from 8 KB to 2 KB. (Full thread information is captured for the first 20 threads, then 2 KB per thread after that.) The minidump file can still grow larger than the "hinted" size. For example, if you have 10,000 threads, the minidump file can be more than 20 MB.

Parquet scanner memory increase after IMPALA-2736

The initial release of CDH 5.8 / Impala 2.6 sometimes has a higher peak memory usage than in previous releases while reading Parquet files.

CDH 5.8 / Impala 2.6 addresses the issue IMPALA-2736, which improves the efficiency of Parquet scans by up to 2x. The faster scans may result in a higher peak memory consumption compared to earlier versions of Impala due to the new column-wise row materialization strategy. You are likely to experience higher memory consumption in any of the following scenarios:
  • Very wide rows due to projecting many columns in a scan.

  • Very large rows due to big column values, for example, long strings or nested collections with many items.

  • Producer/consumer speed imbalances, leading to more rows being buffered between a scan (producer) and downstream (consumer) plan nodes.

Bug: IMPALA-3662

Severity: High

Workaround: The following query options might help to reduce memory consumption in the Parquet scanner:
  • Reduce the number of scanner threads, for example: set num_scanner_threads=30
  • Reduce the batch size, for example: set batch_size=512
  • Increase the memory limit, for example: set mem_limit=64g

Resolution: Fixed in CDH 5.10 / Impala 2.8.

Process mem limit does not account for the JVM's memory usage

Some memory allocated by the JVM used internally by Impala is not counted against the memory limit for the impalad daemon.

Bug: IMPALA-691

Workaround: To monitor overall memory usage, use the top command, or add the memory figures in the Impala web UI /memz tab to JVM memory usage shown on the /metrics tab.

Fix issues with the legacy join and agg nodes using --enable_partitioned_hash_join=false and --enable_partitioned_aggregation=false

Bug: IMPALA-2375

Workaround: Transition away from the "old-style" join and aggregation mechanism if practical.

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

Impala Known Issues: Correctness

These issues can cause incorrect or unexpected results from queries. They typically only arise in very specific circumstances.

DECIMAL AVG() can return incorrect results in Impala

When both an AVG() over a DECIMAL value and a DISTINCT aggregate (for example, COUNT(DISTINCT ...)) appear in the same SELECT statement, the DECIMAL AVG() can return incorrect results.

Bug: IMPALA-5251

Severity: High

Releases affected: CDH 5.11.0 only

Immediate action required:

  • If you use Impala and are not yet on CDH 5.11.0, wait for CDH 5.11.1 to upgrade.

  • If you use Impala and are in process of upgrading to CDH 5.11.0, or must upgrade before CDH 5.11.1 becomes available, contact Cloudera Support to get a patch to avoid this issue.

  • If you are on CDH 5.11.0 and use Impala, contact Cloudera Support to get a patch to avoid this issue.

  • If you do not use Impala, you are not affected.

Resolution: Fixed in CDH 5.12.0 and higher / Impala 2.9.0

Parquet scanner memory bug: I/O buffer is attached to output batch while scratch batch rows still reference it

Impala queries may return incorrect results when scanning plain-encoded string columns in uncompressed Parquet files. I/O buffers holding the string data are prematurely freed, leading to invalid memory reads and possibly non-deterministic results. This does not affect Parquet files that use a compression codec such as Snappy. Snappy is both strongly recommended generally and the default choice for Impala-written Parquet files.

How to determine whether a query might be affected:

  • The query must reference STRING columns from a Parquet table.
  • A selective filter on the Parquet table makes this issue more likely.
  • Identify any uncompressed Parquet files processed by the query. Examine the HDFS_SCAN_NODE portion of a query profile that scans the suspected table. Use a query that performs a full table scan, and materializes the column values. (For example, SELECT MIN(colname) FROM tablename.) Look for "File Formats". A value containing PARQUET/NONE means uncompressed Parquet.
  • Identify any plain-encoded string columns in the associated table. Pay special attention to tables containing Parquet files generated through Hive, Spark, or other mechanisms outside of Impala, because Impala uses Snappy compression by default for Parquet files. Use parquet-tools to dump the file metadata. Note that a column could have several encodings within the same file (the column data is stored in several column chunks). Look for VLE:PLAIN in the output of parquet-tools, which means the values are plain encoded.

Bug: IMPALA-4539

Severity: High

Resolution: Fixed in CDH 5.10.0 / Impala 2.8 and higher

Workaround: Use Snappy or another compression codec for Parquet files.

ABS(n) where n is the lowest bound for the int types returns negative values

If the abs() function evaluates a number that is right at the lower bound for an integer data type, the positive result cannot be represented in the same type, and the result is returned as a negative number. For example, abs(-128) returns -128 because the argument is interpreted as a TINYINT and the return value is also a TINYINT.

Bug: IMPALA-4513

Severity: High

Resolution: Fixed in CDH 5.14.0 and higher / Impala 2.11.0

Workaround: Cast the integer value to a larger type. For example, rewrite abs(tinyint_col) as abs(cast(tinyint_col as smallint)).

Java udf expression returning string in group by can give incorrect results.

If the GROUP BY clause included a call to a Java UDF that returned a string value, the UDF could return an incorrect result.

Bug: IMPALA-4266

Severity: High

Resolution: Fixed in CDH 5.10 / Impala 2.8 and higher.

Workaround: Rewrite the expression to concatenate the results of the Java UDF with an empty string call. For example, rewrite my_hive_udf() as concat(my_hive_udf(), '').

Incorrect assignment of NULL checking predicate through an outer join of a nested collection.

A query could return wrong results (too many or too few NULL values) if it referenced an outer-joined nested collection and also contained a null-checking predicate (IS NULL, IS NOT NULL, or the <=> operator) in the WHERE clause.

Bug: IMPALA-3084

Severity: High

Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0.

Incorrect result due to constant evaluation in query with outer join

An OUTER JOIN query could omit some expected result rows due to a constant such as FALSE in another join clause. For example:

explain SELECT 1 FROM alltypestiny a1
  INNER JOIN alltypesagg a2 ON a1.smallint_col = a2.year AND false
  RIGHT JOIN alltypes a3 ON a1.year = a1.bigint_col;
+---------------------------------------------------------+
| Explain String                                          |
+---------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=1.00KB VCores=1 |
|                                                         |
| 00:EMPTYSET                                             |
+---------------------------------------------------------+

Bug: IMPALA-3094

Severity: High

Workaround: None

Incorrect assignment of an inner join On-clause predicate through an outer join.

Impala might return incorrect results for queries that have the following properties:

  • There is an INNER JOIN following a series of OUTER JOINs.

  • The INNER JOIN has an On-clause with a predicate that references at least two tables that are on the nullable side of the preceding OUTER JOINs.

The following query demonstrates the issue:

select 1 from functional.alltypes a left outer join
  functional.alltypes b on a.id = b.id left outer join
  functional.alltypes c on b.id = c.id right outer join
  functional.alltypes d on c.id = d.id inner join functional.alltypes e
on b.int_col = c.int_col;

The following listing shows the incorrect EXPLAIN plan:

+-----------------------------------------------------------+
| Explain String                                            |
+-----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=480.04MB VCores=4 |
|                                                           |
| 14:EXCHANGE [UNPARTITIONED]                               |
| |                                                         |
| 08:NESTED LOOP JOIN [CROSS JOIN, BROADCAST]               |
| |                                                         |
| |--13:EXCHANGE [BROADCAST]                                |
| |  |                                                      |
| |  04:SCAN HDFS [functional.alltypes e]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |                                                         |
| 07:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]              |
| |  hash predicates: c.id = d.id                           |
| |  runtime filters: RF000 <- d.id                         |
| |                                                         |
| |--12:EXCHANGE [HASH(d.id)]                               |
| |  |                                                      |
| |  03:SCAN HDFS [functional.alltypes d]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |                                                         |
| 06:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]               |
| |  hash predicates: b.id = c.id                           |
| |  other predicates: b.int_col = c.int_col     <--- incorrect placement; should be at node 07 or 08
| |  runtime filters: RF001 <- c.int_col                    |
| |                                                         |
| |--11:EXCHANGE [HASH(c.id)]                               |
| |  |                                                      |
| |  02:SCAN HDFS [functional.alltypes c]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |     runtime filters: RF000 -> c.id                      |
| |                                                         |
| 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]              |
| |  hash predicates: b.id = a.id                           |
| |  runtime filters: RF002 <- a.id                         |
| |                                                         |
| |--10:EXCHANGE [HASH(a.id)]                               |
| |  |                                                      |
| |  00:SCAN HDFS [functional.alltypes a]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |                                                         |
| 09:EXCHANGE [HASH(b.id)]                                  |
| |                                                         |
| 01:SCAN HDFS [functional.alltypes b]                      |
|    partitions=24/24 files=24 size=478.45KB                |
|    runtime filters: RF001 -> b.int_col, RF002 -> b.id     |
+-----------------------------------------------------------+

Bug: IMPALA-3126

Severity: High

Resolution: Fixed in CDH 5.10.0 / Impala 2.8.0 and higher

Workaround: High

For some queries, this problem can be worked around by placing the problematic ON clause predicate in the WHERE clause instead, or changing the preceding OUTER JOINs to INNER JOINs (if the ON clause predicate would discard NULLs). For example, to fix the problematic query above:

select 1 from functional.alltypes a
  left outer join functional.alltypes b
    on a.id = b.id
  left outer join functional.alltypes c
    on b.id = c.id
  right outer join functional.alltypes d
    on c.id = d.id
  inner join functional.alltypes e
where b.int_col = c.int_col

+-----------------------------------------------------------+
| Explain String                                            |
+-----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=480.04MB VCores=4 |
|                                                           |
| 14:EXCHANGE [UNPARTITIONED]                               |
| |                                                         |
| 08:NESTED LOOP JOIN [CROSS JOIN, BROADCAST]               |
| |                                                         |
| |--13:EXCHANGE [BROADCAST]                                |
| |  |                                                      |
| |  04:SCAN HDFS [functional.alltypes e]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |                                                         |
| 07:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]              |
| |  hash predicates: c.id = d.id                           |
| |  other predicates: b.int_col = c.int_col          <-- correct assignment
| |  runtime filters: RF000 <- d.id                         |
| |                                                         |
| |--12:EXCHANGE [HASH(d.id)]                               |
| |  |                                                      |
| |  03:SCAN HDFS [functional.alltypes d]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |                                                         |
| 06:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]               |
| |  hash predicates: b.id = c.id                           |
| |                                                         |
| |--11:EXCHANGE [HASH(c.id)]                               |
| |  |                                                      |
| |  02:SCAN HDFS [functional.alltypes c]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |     runtime filters: RF000 -> c.id                      |
| |                                                         |
| 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]              |
| |  hash predicates: b.id = a.id                           |
| |  runtime filters: RF001 <- a.id                         |
| |                                                         |
| |--10:EXCHANGE [HASH(a.id)]                               |
| |  |                                                      |
| |  00:SCAN HDFS [functional.alltypes a]                   |
| |     partitions=24/24 files=24 size=478.45KB             |
| |                                                         |
| 09:EXCHANGE [HASH(b.id)]                                  |
| |                                                         |
| 01:SCAN HDFS [functional.alltypes b]                      |
|    partitions=24/24 files=24 size=478.45KB                |
|    runtime filters: RF001 -> b.id                         |
+-----------------------------------------------------------+

Impala may use incorrect bit order with BIT_PACKED encoding

Parquet BIT_PACKED encoding as implemented by Impala is LSB first. The parquet standard says it is MSB first.

Bug: IMPALA-3006

Severity: High, but rare in practice because BIT_PACKED is infrequently used, is not written by Impala, and is deprecated in Parquet 2.0.

BST between 1972 and 1995

The calculation of start and end times for the BST (British Summer Time) time zone could be incorrect between 1972 and 1995. Between 1972 and 1995, BST began and ended at 02:00 GMT on the third Sunday in March (or second Sunday when Easter fell on the third) and fourth Sunday in October. For example, both function calls should return 13, but actually return 12, in a query such as:

select
  extract(from_utc_timestamp(cast('1970-01-01 12:00:00' as timestamp), 'Europe/London'), "hour") summer70start,
  extract(from_utc_timestamp(cast('1970-12-31 12:00:00' as timestamp), 'Europe/London'), "hour") summer70end;

Bug: IMPALA-3082

Severity: High

parse_url() returns incorrect result if @ character in URL

If a URL contains an @ character, the parse_url() function could return an incorrect value for the hostname field.

Bug: https://issues.cloudera.org/browse/IMPALA-1170IMPALA-1170

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0 and CDH 5.5.4 / Impala 2.3.4.

% escaping does not work correctly when occurs at the end in a LIKE clause

If the final character in the RHS argument of a LIKE operator is an escaped \% character, it does not match a % final character of the LHS argument.

Bug: IMPALA-2422

ORDER BY rand() does not work.

Because the value for rand() is computed early in a query, using an ORDER BY expression involving a call to rand() does not actually randomize the results.

Bug: IMPALA-397

Resolution: Fixed in CDH 5.12.0 / Impala 2.9.0

Wrong results with correlated WHERE clause subquery inside a NULL-checking conditional function

Impala may generate an incorrect plan, and therefore incorrect results, for queries that have a correlated scalar subquery as a parameter to a NULL-checking conditional function such as isnull().

Bug: IMPALA-4373

Severity: High

Workaround: None

Cannot execute IR UDF when single node execution is enabled

A UDF compiled into an LLVM IR bitcode module (.bc) would have undefined effects when native code generation was turned off, for example when Impala applied the single-node optimization for small queries.

Bug: IMPALA-4432

Severity: High

Resolution: In CDH 5.10 / Impala 2.8 and higher, Impala returns an error if the UDF cannot run because of this issue.

Workaround: Turn native code generation back on with the query option setting DISABLE_CODEGEN=0.

Duplicated column in inline view causes dropping null slots during scan

If the same column is queried twice within a view, NULL values for that column are omitted. For example, the result of COUNT(*) on the view could be less than expected.

Bug: IMPALA-2643

Workaround: Avoid selecting the same column twice within an inline view.

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.10 / Impala 2.2.10.

Incorrect assignment of predicates through an outer join in an inline view.

A query involving an OUTER JOIN clause where one of the table references is an inline view might apply predicates from the ON clause incorrectly.

Bug: IMPALA-1459

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.9 / Impala 2.2.9.

Crash: impala::Coordinator::ValidateCollectionSlots

A query could encounter a serious error if includes multiple nested levels of INNER JOIN clauses involving subqueries.

Bug: IMPALA-2603

Incorrect assignment of On-clause predicate inside inline view with an outer join.

A query might return incorrect results due to wrong predicate assignment in the following scenario:

  1. There is an inline view that contains an outer join
  2. That inline view is joined with another table in the enclosing query block
  3. That join has an On-clause containing a predicate that only references columns originating from the outer-joined tables inside the inline view

Bug: IMPALA-2665

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.9 / Impala 2.2.9.

Wrong assignment of having clause predicate across outer join

In an OUTER JOIN query with a HAVING clause, the comparison from the HAVING clause might be applied at the wrong stage of query processing, leading to incorrect results.

Bug: IMPALA-2144

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

Wrong plan of NOT IN aggregate subquery when a constant is used in subquery predicate

A NOT IN operator with a subquery that calls an aggregate function, such as NOT IN (SELECT SUM(...)), could return incorrect results.

Bug: IMPALA-2093

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0 and CDH 5.5.4 / Impala 2.3.4.

Impala Known Issues: Metadata

These issues affect how Impala interacts with metadata. They cover areas such as the metastore database, the COMPUTE STATS statement, and the Impala catalogd daemon.

Concurrent catalog operations with heavy DDL workloads can cause queries with SYNC_DDL to fail fast

When Catalog Server is under a heavy load with concurrent catalog operations of long running DDLs, queries running with the SYNC_DDL query option can fail with the following message:
ERROR: CatalogException: Couldn't retrieve the catalog topic
version for the SYNC_DDL operation after 3 attempts.The operation has
been successfully executed but its effects may have not been
broadcast to all the coordinators.

The catalog operation is actually successful as the change has been committed to HMS and Catalog Server cache, but when Catalog Server notices a longer than expected time for it to broadcast the changes, it fails fast.

The coordinator daemons eventually sync up in the background.

Affected Versions: CDH versions from 5.15 to 6.1

Bug: IMPALA-7961

Resolution: Fixed in CDH 6.2.0 / Impala 3.2.

Catalogd may crash when loading metadata for tables with many partitions, many columns, and with incremental stats

Incremental stats use up about 400 bytes per partition for each column. For example, for a table with 20K partitions and 100 columns, the memory overhead from incremental statistics is about 800 MB. When serialized for transmission across the network, this metadata exceeds the 2 GB Java array size limit and leads to a catalogd crash.

Bugs: IMPALA-2647, IMPALA-2648, IMPALA-2649

Workaround: If feasible, compute full stats periodically and avoid computing incremental stats for that table. The scalability of incremental stats computation is a continuing work item.

Can't update stats manually via alter table after upgrading to CDH 5.2

Bug: IMPALA-1420

Workaround: On CDH 5.2, when adjusting table statistics manually by setting the numRows, you must also enable the Boolean property STATS_GENERATED_VIA_STATS_TASK. For example, use a statement like the following to set both properties with a single ALTER TABLE statement:

ALTER TABLE table_name SET TBLPROPERTIES('numRows'='new_value', 'STATS_GENERATED_VIA_STATS_TASK' = 'true');

Resolution: The underlying cause is the issue HIVE-8648 that affects the metastore in Hive 0.13. The workaround is only needed until the fix for this issue is incorporated into a CDH release.

Impala Known Issues: Interoperability

These issues affect the ability to interchange data between Impala and other systems. They cover areas such as data types and file formats.

Queries Stuck on Failed HDFS Calls and not Timing out

In CDH 6.2 / Impala 3.2 and higher, if the following error appears multiple times in a short duration while running a query, it would mean that the connection between the impalad and the HDFS NameNode is in a bad state and hence the impalad would have to be restarted:

"hdfsOpenFile() for <filename> at backend <hostname:port> failed to finish before the <hdfs_operation_timeout_sec> second timeout " 

In CDH 6.1 / Impala 3.1 and lower, the same issue would cause Impala to wait for a long time or hang without showing the above error message.

Bug: HADOOP-15720

Affected Versions: All versions of Impala

Workaround: Restart the impalad in the bad state.

CREATE TABLE AS SELECT (CTAS) fails to write to HDFS

This issue can occur on clusters where HDFS NameNode high availability is enabled. The CREATE TABLE AS SELECT statement fails to open the HDFS file for writing. When this condition occurs an error is thrown in the Hue UI and logs that starts with "Failed to open HDFS file for writing:".

Severity: High

Workaround:To write data to HDFS on clusters where HDFS NameNode high availability is enabled, manually create and populate the table using the CREATE TABLE statement followed by INSERT.

DESCRIBE FORMATTED gives error on Avro table

This issue can occur either on old Avro tables (created prior to Hive 1.1 / CDH 5.4) or when changing the Avro schema file by adding or removing columns. Columns added to the schema file will not show up in the output of the DESCRIBE FORMATTED command. Removing columns from the schema file will trigger a NullPointerException.

As a workaround, you can use the output of SHOW CREATE TABLE to drop and recreate the table. This will populate the Hive metastore database with the correct column definitions.

Severity: High

Resolution: Fixed in CDH 5.8.2 and CDH 5.9.0 / Impala 2.6.2 and Impala 2.7.0

Avro Scanner fails to parse some schemas

The default value in Avro schema must match the first union type. For example, if the default value is null, then the first type in the UNION must be "null".

Bug: IMPALA-635

Workaround: Swap the order of the fields in the schema specification. For example, use ["null", "string"] instead of ["string", "null"]. Note that the files written with the problematic schema must be rewritten with the new schema because Avro files have embedded schemas.

Impala BE cannot parse Avro schema that contains a trailing semi-colon

If an Avro table has a schema definition with a trailing semicolon, Impala encounters an error when the table is queried.

Bug: IMPALA-1024

Severity: Remove trailing semicolon from the Avro schema.

Fix decompressor to allow parsing gzips with multiple streams

Currently, Impala can only read gzipped files containing a single stream. If a gzipped file contains multiple concatenated streams, the Impala query only processes the data from the first stream.

Bug: IMPALA-2154

Workaround: Use a different gzip tool to compress file to a single stream file.

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

Impala incorrectly handles text data when the new line character \n\r is split between different HDFS block

If a carriage return / newline pair of characters in a text table is split between HDFS data blocks, Impala incorrectly processes the row following the \n\r pair twice.

Bug: IMPALA-1578

Workaround: Use the Parquet format for large volumes of data where practical.

Resolution: Fixed in CDH 5.8.0 / Impala 2.6.0.

Invalid bool value not reported as a scanner error

In some cases, an invalid BOOLEAN value read from a table does not produce a warning message about the bad value. The result is still NULL as expected. Therefore, this is not a query correctness issue, but it could lead to overlooking the presence of invalid data.

Bug: IMPALA-1862

Resolution: Fixed in CDH 5.10.0 and higher / Impala 2.8.0

Incorrect results with basic predicate on CHAR typed column.

When comparing a CHAR column value to a string literal, the literal value is not blank-padded and so the comparison might fail when it should match.

Bug: IMPALA-1652

Workaround: Use the RPAD() function to blank-pad literals compared with CHAR columns to the expected length.

Impala Known Issues: Limitations

These issues are current limitations of Impala that require evaluation as you plan how to integrate Impala into your data management workflow.

Set limits on size of expression trees

Very deeply nested expressions within queries can exceed internal Impala limits, leading to excessive memory usage.

Bug: IMPALA-4551

Severity: High

Workaround: Avoid queries with extremely large expression trees. Setting the query option disable_codegen=true may reduce the impact, at a cost of longer query runtime.

Hue and BDR require separate parameters for Impala Load Balancer

Cloudera Manager supports a single parameter for specifying the Impala Daemon Load Balancer. However, because BDR and Hue need to use different ports when connecting to the load balancer, it is not possible to configure the load balancer value so that BDR and Hue will work correctly in the same cluster.

Workaround: To configure BDR with Impala, use the load balancer configuration either without a port specification or with the Beeswax port.

To configure Hue, use the Hue Server Advanced Configuration Snippet (Safety Valve) for impalad_flags to specify the load balancer address with the HiveServer2 port.

Affected Versions: CDH versions from 5.11 to 6.0

Bug: OPSAPS-46641

Impala Known Issues: Miscellaneous / Older Issues

These issues do not fall into one of the above categories or have not been categorized yet.

A failed CTAS does not drop the table if the insert fails.

If a CREATE TABLE AS SELECT operation successfully creates the target table but an error occurs while querying the source table or copying the data, the new table is left behind rather than being dropped.

Bug: IMPALA-2005

Workaround: Drop the new table manually after a failed CREATE TABLE AS SELECT.

Casting scenarios with invalid/inconsistent results

Using a CAST() function to convert large literal values to smaller types, or to convert special values such as NaN or Inf, produces values not consistent with other database systems. This could lead to unexpected results from queries.

Bug: IMPALA-1821

Support individual memory allocations larger than 1 GB

The largest single block of memory that Impala can allocate during a query is 1 GiB. Therefore, a query could fail or Impala could crash if a compressed text file resulted in more than 1 GiB of data in uncompressed form, or if a string function such as group_concat() returned a value greater than 1 GiB.

Bug: IMPALA-1619

Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0 and CDH 5.8.3 / Impala 2.6.3.

Impala Parser issue when using fully qualified table names that start with a number.

A fully qualified table name starting with a number could cause a parsing error. In a name such as db.571_market, the decimal point followed by digits is interpreted as a floating-point number.

Bug: IMPALA-941

Workaround: Surround each part of the fully qualified name with backticks (``).

Impala should tolerate bad locale settings

If the LC_* environment variables specify an unsupported locale, Impala does not start.

Bug: IMPALA-532

Workaround: Add LC_ALL="C" to the environment settings for both the Impala daemon and the Statestore daemon. See Modifying Impala Startup Options for details about modifying these environment settings.

Resolution: Fixing this issue would require an upgrade to Boost 1.47 in the Impala distribution.

Log Level 3 Not Recommended for Impala

The extensive logging produced by log level 3 can cause serious performance overhead and capacity issues.

Workaround: Reduce the log level to its default value of 1, that is, GLOG_v=1. See Setting Logging Levels for details about the effects of setting different logging levels.