Known Issues and Workarounds in Impala

Known Issues in the Current Release

The impala-shell command in Impala 1.0.1 does not work with Python 2.4, which is the default on Red Hat 5.

For the impala-shell command in Impala 1.0, the -o option (pipe output to a file) does not work with Python 2.4.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-396

Severity: High

Workaround: On Linux systems with Python 2.4, use the impala-shell package from Impala 1.0, and avoid using the -o option.

Resolution: To be fixed in a future release

10-20% perf regression for most queries across all table formats

This issue is due to a performance tradeoff between systems running many queries concurrently, and systems running a single query. Systems running only a single query could experience lower performance than in early beta releases. Systems running many queries simultaneously should experience higher performance than in the beta releases.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-335

Severity: High

Resolution: To be fixed in a future release

Impala does not support running on clusters with federated namespaces

Impala does not support running on clusters with federated namespaces. The impalad process will not start on a node running such a filesystem based on the org.apache.hadoop.fs.viewfs.ViewFs class.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-77

Severity: Undetermined

Anticipated Resolution: Limitation

Workaround: Use standard HDFS on all Impala nodes.

Order of table references in FROM clause is critical for optimal performance

Impala does not currently optimize the join order of queries; instead, it joins tables in the order in which they are listed in the FROM clause. Queries that contain one or more large tables on the right hand side of joins (either an explicit join expressed as a JOIN statement or a join implicit in the list of table references in the FROM clause) may run slowly or crash Impala due to out-of-memory errors. For example:

SELECT ... FROM small_table JOIN large_table

Severity: Medium

Anticipated Resolution: To be fixed in a future release

Workaround: Modify query, if possible, to join the largest table first. For example:

SELECT ... FROM small_table JOIN large_table

should be modified to:

SELECT ... FROM large_table JOIN small_table

Impala INSERT OVERWRITE ... SELECT behavior differs from Hive in that partitions are only deleted/re-written if the SELECT statement returns data.

Impala INSERT OVERWRITE ... SELECT behavior differs from Hive in that the partitions are only deleted or rewritten if the SELECT statement returns data. Hive always deletes the data.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-89

Severity: Medium

Workaround: None

Deviation from Hive behavior: Out of range values float/double values are returned as maximum allowed value of type (Hive returns NULL)

Impala behavior differs from Hive with respect to out of range float/double values. Out of range values are returned as maximum allowed value of type (Hive returns NULL).

Severity: Low

Workaround: None

Deviation from Hive behavior: Impala does not do implicit casts between string and numeric and boolean types.

Severity: Low

Anticipated Resolution: None

Workaround: Use explicit casts.

If Hue and Impala are installed on the same host, and if you configure Hue Beeswax in CDH 4.1 to execute Impala queries, Beeswax cannot list Hive tables and shows an error on Beeswax startup.

Hue requires Beeswaxd to be running in order to list the Hive tables. Because of a port conflict bug in Hue in CDH4.1 when Hue and Impala are installed on the same host, an error page is displayed when you start the Beeswax application, and when you open the Tables page in Beeswax.

Severity: High

Anticipated Resolution: Fixed in an upcoming CDH4 release

Workarounds: Choose one of the following workarounds (but only one):

  • Install Hue and Impala on different hosts. OR
  • Upgrade to CDH4.1.2 and add the following property in the beeswax section of the /etc/hue/hue.ini configuration file:
    beeswax_meta_server_only=9004

OR

  • If you are using CDH4.1.1 and you want to install Hue and Impala on the same host, change the code in this file:
    /usr/share/hue/apps/beeswax/src/beeswax/management/commands/beeswax_server.py

    Replace line 66:

    str(beeswax.conf.BEESWAX_SERVER_PORT.get()),

    With this line:

    '8004',

    Beeswaxd will then use port 8004.

      Note:

    If you used Cloudera Manager to install Impala, refer to the Cloudera Manager release notes for information about using an equivalent workaround by specifying the beeswax_meta_server_only=9004 configuration value in the Hue Service Configuration Safety Valve.

Known Issues Fixed in the 1.0.1 Release

This section lists the most significant issues fixed in Impala 1.0.1. For the full list of fixed issues, see this report in the JIRA system.

Impala parquet scanner can not read all data files generated by other frameworks

Impala might issue an erroneous error message when processing a Parquet data file produced by a non-Impala Hadoop component.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-333

Severity: High

Resolution: Fixed

Impala is unable to query RCFile tables which describe fewer columns than the file's header.

If an RCFile table definition had fewer columns than the fields actually in the data files, queries would fail.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-293

Severity: High

Resolution: Fixed

Impala does not correctly substitute _HOST with hostname in --principal

The _HOST placeholder in the --principal startup option was not substituted with the correct hostname, potentially leading to a startup error in setups using Kerberos authentication.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-351

Severity: High

Resolution: Fixed

HBase query missed the last region

A query for an HBase table could omit data from the last region.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-

Severity: High

Resolution: Fixed

Hbase region changes are not handled correctly

After a region in an HBase table was split or moved, an Impala query might return incomplete or out-of-date results.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-300

Severity: High

Resolution: Fixed

Query state for successful create table is EXCEPTION

After a successful CREATE TABLE statement, the corresponding query state would be incorrectly reported as EXCEPTION.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-349

Severity: High

Resolution: Fixed

Double check release of JNI-allocated byte-strings

Operations involving calls to the Java JNI subsystem (for example, queries on HBase tables) could allocate memory but not release it.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-358

Severity: High

Resolution: Fixed

Impala returns 0 for bad time values in UNIX_TIMESTAMP, Hive returns NULL

Impala returns 0 for bad time values in UNIX_TIMESTAMP, Hive returns NULL.

Impala:

impala> select UNIX_TIMESTAMP('10:02:01') ;
impala> 0

Hive:

hive> select UNIX_TIMESTAMP('10:02:01') FROM tmp;
hive> NULL

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-16

Severity: Low

Anticipated Resolution: Fixed

INSERT INTO TABLE SELECT <constant> does not work.

Insert INTO TABLE SELECT <constant> will not insert any data and may return an error.

Severity: Low

Anticipated Resolution: Fixed

Distributed outer join returns wrong result

When you execute an outer join query in distributed mode, Impala will return more rows than expected (that is, the wrong result).

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-90

Severity: Medium

Anticipated Resolution: To be fixed in a future release

Workaround: None

Known Issues Fixed in the 1.0 GA Release

Here are the major user-visible issues fixed in Impala 1.0. For a full list of fixed issues, see this report in the public issue tracker.

Impala does not properly modify DATETIME and TIMESTAMP partition keys via DML or DDL statements

Impala does not properly modify DATETIME and TIMESTAMP partition keys via DML or DDL statements. Modifying partition columns of this datatype may result in failed queries and/or table metadata loading problems.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-238

Severity: High

Resolution: Fixed

Undeterministically receive "ERROR: unknown row bach destination..." and "ERROR: Invalid query handle" from impala shell when running union query

A query containing both UNION and LIMIT clauses could intermittently cause the impalad process to halt with a segmentation fault.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-183

Severity: High

Resolution: Fixed

Insert with NULL partition keys results in SIGSEGV.

An INSERT statement specifying a NULL value for one of the partitioning columns could cause the impalad process to halt with a segmentation fault.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-190

Severity: High

Resolution: Fixed

INSERT queries don't show completed profiles on the debug webpage

In the Impala web user interface, the profile page for an INSERT statement showed obsolete information for the statement once it was complete.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-217

Severity: High

Resolution: Fixed

Impala HBase scan is very slow

Queries involving an HBase table could be slower than expected, due to excessive memory usage on the Impala nodes.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-231

Severity: High

Resolution: Fixed

Add some library version validation logic to impalad when loading impala-lzo shared library

No validation was done to check that the impala-lzo shared library was compatible with the version of Impala, possibly leading to a crash when using LZO-compressed text files.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-234

Severity: High

Resolution: Fixed

Workaround: Always upgrade the impala-lzo library at the same time as you upgrade Impala itself.

Problems inserting into tables with TIMESTAMP partition columns leading table metadata loading failures and failed dchecks

INSERT statements for tables partitioned on columns involving datetime types could appear to succeed, but cause errors for subsequent queries on those tables. The problem was especially serious if an improperly formatted timestamp value was specified for the partition key.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-238

Severity: Critical

Resolution: Fixed

Ctrl-C sometimes interrupts shell in system call, rather than cancelling query

Pressing Ctrl-C in the impala-shell interpreter could sometimes display an error and return control to the shell, making it impossible to cancel the query.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-243

Severity: Critical

Resolution: Fixed

Empty string partition value causes metastore update failure

Specifying an empty string or NULL for a partition key in an INSERT statement would fail.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-252

Severity: High

Resolution: Fixed. The behavior for empty partition keys was made more compatible with the corresponding Hive behavior.

Round() does not output the right precision

The round() function did not always return the correct number of significant digits.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-266

Severity: High

Resolution: Fixed

Cannot cast string literal to string

Casting from a string literal back to the same type would cause an "invalid type cast" error rather than leaving the original value unchanged.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-267

Severity: High

Resolution: Fixed

Excessive mem usage for certain queries which are very selective

Some queries that returned very few rows experienced unnecessary memory usage.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-288

Severity: High

Resolution: Fixed

HdfsScanNode crashes in UpdateCounters

A serious error could occur for relatively small and inexpensive queries.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-289

Severity: High

Resolution: Fixed

Parquet performance issues on large dataset

Certain aggregation queries against Parquet tables were inefficient due to lower than required thread utilization.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-292

Severity: High

Resolution: Fixed

impala not populating hive metadata correctly for create table

The Impala CREATE TABLE command did not fill in the owner and tbl_type columns in the Hive metastore database.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-295

Severity: High

Resolution: Fixed. The metadata was made more Hive-compatible.

impala daemons die if statestore goes down

The impalad instances in a cluster could halt when the statestored process became unavailable.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-312

Severity: High

Resolution: Fixed

Constant SELECT clauses do not work in subqueries

A subquery would fail if the SELECT statement inside it returned a constant value rather than querying a table.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-67

Severity: High

Resolution: Fixed

Right outer Join includes NULLs as well and hence wrong result count

The result set from a right outer join query could include erroneous rows containing NULL values.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-90

Severity: High

Resolution: Fixed

Parquet scanner hangs for some queries

The Parquet scanner non-deterministically hangs when executing some queries.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-204

Severity: Medium

Resolution: Fixed

Known Issues Fixed in Version 0.7 of the Beta Release

Impala does not gracefully handle unsupported Hive table types (INDEX and VIEW tables)

When attempting to load metadata from an unsupported Hive table type (INDEX and VIEW tables), Impala fails with an unclear error message.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-167

Severity: Low

Resolution: Fixed in 0.7

DDL statements (CREATE/ALTER/DROP TABLE) are not supported in the Impala Beta Release

Severity: Medium

Resolution: Fixed in 0.7

Avro is not supported in the Impala Beta Release

Severity: Medium

Resolution: Fixed in 0.7

Workaround: None

Impala does not currently allow limiting the memory consumption of a single query

It is currently not possible to limit the memory consumption of a single query. All tables on the right hand side of JOIN statements need to be able to fit in memory. If they do not, Impala may crash due to out of memory errors.

Severity: High

Resolution: Fixed in 0.7

Aggregate of a subquery result set returns wrong results if the subquery contains a 'limit' and data is distributed across multiple nodes

Aggregate of a subquery result set returns wrong results if the subquery contains a 'limit' clause and data is distributed across multiple nodes. From the query plan, it looks like we are just summing the results from each slave.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-20

Severity: Low

Resolution: Fixed in 0.7

Partition pruning for arbitrary predicates that are fully bound by a particular partition column

We currently can't utilize a predicate like "country_code in ('DE', 'FR', 'US')" to do partitioning pruning, because that requires an equality predicate or a binary comparison.

We should create a superclass of planner.ValueRange, ValueSet, that can be constructed with an arbitrary predicate, and whose isInRange(analyzer, valueExpr) constructs a literal predicate by substitution of the valueExpr into the predicate.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-144

Severity: Medium

Resolution: Fixed in 0.7

Known Issues Fixed in Version 0.6 of the Beta Release

Impala reads the NameNode address and port as command line parameters

Impala reads the NameNode address and port as command line parameters rather than reading them from core-site.xml. Updating the NameNode address in the core-site.xml file does not propagate to Impala.

Severity: Low

Resolution: Fixed in 0.6 - Impala reads the namenode location and port from the Hadoop configuration files, though setting -nn and -nn_port overrides this. Users are advised not to set -nn or -nn_port.

Queries may fail on secure environment due to impalad Kerberos ticket expiration

Queries may fail on secure environment due to impalad Kerberos tickets expiring. This can happen if the Impala -kerberos_reinit_interval flag is set to a value ten minutes or less. This may lead to an impalad requesting a ticket with a lifetime that is less than the time to the next ticket renewal.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-64

Severity: Medium

Resolution: Fixed in 0.6

Concurrent queries may fail when Impala uses Thrift to communicate with the Hive Metastore

Concurrent queries may fail when Impala is using Thrift to communicate with part of the Hive Metastore such as the Hive Metastore Service. In such a case, the error get_fields failed: out of sequence response" may occur because Impala shared a single Hive Metastore Client connection across threads. With Impala 0.6, a separate connection is used for each metadata request.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-48

Severity: Low

Resolution: Fixed in 0.6

impalad fails to start if unable to connect to the Hive Metastore

Impala fails to start if it is unable to establish a connection with the Hive Metastore. This behavior was fixed, allowing Impala to start, even when no Metastore is available.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-58

Severity: Low

Resolution: Fixed in 0.6

Impala treats database names as case-sensitive in some contexts

In some queries (including "USE database" statements), database names are treated as case-sensitive. This may lead queries to fail with an IllegalStateException.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-44

Severity: Medium

Resolution: Fixed in 0.6

Impala does not ignore hidden HDFS files

Impala does not ignore hidden HDFS files, meaning those files prefixed with a period '.' or underscore '_'. This diverges from Hive/MapReduce, which skips these files.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-18

Severity: Low

Resolution: Fixed in 0.6

Known Issues Fixed in Version 0.5 of the Beta Release

Impala may have reduced performance on tables that contain a large number of partitions

Impala may have reduced performance on tables that contain a large number of partitions. This is due to extra overhead reading/parsing the partition metadata.

Severity: High

Resolution: Fixed in 0.5

Backend client connections not getting cached causes an observable latency in secure clusters

Backend impalads do not cache connections to the coordinator. On a secure cluster, this introduces a latency proportional to the number of backend clients involved in query execution, as the cost of establishing a secure connection is much higher than in the non-secure case.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-38

Severity: Medium

Resolution: Fixed in 0.5

Concurrent queries may fail with error: "Table object has not been been initialised : `PARTITIONS`"

Concurrent queries may fail with error: "Table object has not been been initialised : `PARTITIONS`". This was due to a lack of locking in the Impala table/database metadata cache.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-30

Severity: Medium

Resolution: Fixed in 0.5

UNIX_TIMESTAMP format behaviour deviates from Hive when format matches a prefix of the time value

The Impala UNIX_TIMESTAMP(val, format) operation compares the length of format and val and returns NULL if they do not match. Hive instead effectively truncates val to the length of the format parameter.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-15

Severity: Medium

Resolution: Fixed in 0.5

Known Issues Fixed in Version 0.4 of the Beta Release

Impala fails to refresh the Hive metastore if a Hive temporary configuration file is removed

Impala is impacted by Hive bug HIVE-3596 which may cause metastore refreshes to fail if a Hive temporary configuration file is deleted (normally located at /tmp/hive-<user>-<tmp_number>.xml). Additionally, the impala-shell will incorrectly report that the failed metadata refresh completed successfully.

Severity: Medium

Anticipated Resolution: To be fixed in a future release

Workaround: Restart the impalad service. Use the impalad log to check for metadata refresh errors.

lpad/rpad builtin functions is not correct.

The lpad/rpad builtin functions generate the wrong results.

Severity: Mild

Resolution: Fixed in 0.4

Files with .gz extension reported as 'not supported'

Compressed files with extensions incorrectly generate an exception.

Cloudera Bug: https://issues.cloudera.org/browse/IMPALA-14

Severity: High

Resolution: Fixed in 0.4

Queries with large limits would hang.

Some queries with large limits were hanging.

Severity: High

Resolution: Fixed in 0.4

Order by on a string column produces incorrect results if there are empty strings

Severity: Low

Resolution: Fixed in 0.4

Known Issues Fixed in Version 0.3 of the Beta Release

All table loading errors show as unknown table

If Impala is unable to load the metadata for a table for any reason, a subsequent query referring to that table will return an unknown table error message, even if the table is known.

Severity: Mild

Resolution: Fixed in 0.3

A table that cannot be loaded will disappear from SHOW TABLES

After failing to load metadata for a table, Impala removes that table from the list of known tables returned in SHOW TABLES. Subsequent attempts to query the table returns 'unknown table', even if the metadata for that table is fixed.

Severity: Mild

Resolution: Fixed in 0.3

Impala cannot read from HBase tables that are not created as external tables in the hive metastore.

Attempting to select from these tables fails.

Severity: Medium

Resolution: Fixed in 0.3

Certain queries that contain OUTER JOINs may return incorrect results

Queries that contain OUTER JOINs may not return the correct results if there are predicates referencing any of the joined tables in the WHERE clause.

Severity: Medium

Resolution: Fixed in 0.3.

Known Issues Fixed in Version 0.2 of the Beta Release

Subqueries which contain aggregates cannot be joined with other tables or Impala may crash

Subqueries which contain an aggregate cannot be joined with another table or Impala may crash. For example: SELECT * FROM (SELECT sum(col1) FROM some_table GROUP BY col1) t1 JOIN other_table ON (...).

Severity: Medium

Resolution: Fixed in 0.2

An insert with a limit that runs as more than one query fragment inserts more rows than the limit.

For example:

INSERT OVERWRITE TABLE test SELECT * FROM test2 LIMIT 1;

Severity: Medium

Resolution: Fixed in 0.2

Query with limit clause might fail.

For example:

SELECT * FROM test2 LIMIT 1;

Severity: Medium

Resolution: Fixed in 0.2

Files in unsupported compression formats are read as plain text.

Attempting to read such files does not generate a diagnostic.

Severity: Medium

Resolution: Fixed in 0.2

Impala server raises a null pointer exception when running an HBase query.

When querying an HBase table whose row-key is string type, the Impala server may raise a null pointer exception.

Severity: Medium

Resolution: Fixed in 0.2