This is the documentation for Cloudera Impala 1.2.4.
Documentation for other versions is available at Cloudera Documentation.

Query Options for the SET Command

You can specify the following options within an impala-shell session, and those settings affect all queries issued from that session.

Some query options are useful in day-to-day operations for improving usability, performance, or flexibility.

Other query options control special-purpose aspects of Impala operation and are intended primarily for advanced debugging or troubleshooting.

Continue reading:

ABORT_ON_DEFAULT_LIMIT_EXCEEDED

Used in conjunction with DEFAULT_ORDER_BY_LIMIT to make sure results of ORDER BY queries are not truncated by accident. If the result set of an ORDER BY query with no LIMIT clause exceeds the value of the DEFAULT_ORDER_BY_LIMIT option, the query is cancelled rather than returning the incomplete result set.

Type: Boolean

Default: false (shown as 0 in output of SET command)

Query Options Affecting ORDER BY and LIMIT Clauses

When the LIMIT clause is specified, DEFAULT_ORDER_BY_LIMIT and ABORT_ON_DEFAULT_LIMIT_EXCEEDED have no effect:

[localhost:21000] > select x from three_rows order by x limit 5;
Query: select x from three_rows order by x limit 5
Query finished, fetching results ...
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
+---+
Returned 3 row(s) in 0.27s

When only DEFAULT_ORDER_BY_LIMIT is specified, the result set could be truncated:

[localhost:21000] > set default_order_by_limit=5;
DEFAULT_ORDER_BY_LIMIT set to 5
[localhost:21000] > select x from ten_rows order by x;
Query: select x from ten_rows order by x
Query finished, fetching results ...
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+---+
Returned 5 row(s) in 0.30s

When ABORT_ON_DEFAULT_LIMIT_EXCEEDED is specified, the query is cancelled rather than returning a truncated result set:

[localhost:21000] > set abort_on_default_limit_exceeded=true;
ABORT_ON_DEFAULT_LIMIT_EXCEEDED set to true
[localhost:21000] > select x from ten_rows order by x;
Query: select x from ten_rows order by x
Query aborted, unable to fetch data

Backend 0:DEFAULT_ORDER_BY_LIMIT has been exceeded.

ABORT_ON_ERROR

When this option is enabled, Impala cancels a query immediately when any of the nodes encounters an error, rather than continuing and possibly returning incomplete results. This option is enabled by default to hep you gather maximum diagnostic information when an errors occurs, for example, whether the same problem occurred on all nodes or only a single node. Currently, the errors that Impala can skip over involve data corruption, such as a column that contains a string value when expected to contain an integer value.

To control how much logging Impala does for non-fatal errors when ABORT_ON_ERROR is turned off, use the MAX_ERRORS option.

Type: BOOLEAN

Default: false (shown as 0 in output of SET command)

ALLOW_UNSUPPORTED_FORMATS

An obsolete query option from early work on support for file formats. Do not use. Might be removed in the future.

Type: BOOLEAN

Default: false (shown as 0 in output of SET command)

BATCH_SIZE

Number of rows evaluated at a time by SQL operators. Unspecified or a size of 0 uses a predefined default size. Primarily for Cloudera testing.

Default: 0 (meaning 1024)

DEBUG_ACTION

Introduces artificial problem conditions within queries. For internal Cloudera debugging and troubleshooting.

Type: STRING

Default: empty string

DEFAULT_ORDER_BY_LIMIT

Impala queries that use the ORDER BY clause must also include a LIMIT clause, to avoid accidentally producing huge result sets that must be sorted. (Sorting a huge result set is a memory-intensive operation, and no results are returned until the sort finishes, making the query seem less responsive.)

To avoid changing your source code to add a LIMIT clause to every query that uses ORDER BY, you can set the DEFAULT_ORDER_BY_LIMIT query option to the largest number of rows you would ever want or expect to be returned by an ORDER BY query. For example, you might set DEFAULT_ORDER_BY_LIMIT=10 immediately before issuing a query where you only care about the top 10 results. Or you might set DEFAULT_ORDER_BY_LIMIT=1000000 as a sanity check, to make sure any ORDER BY queries never return more than a million rows by accident; the cap would have no effect on queries that return less than a million rows.

The default value of -1 signifies no upper limit on the size of the result set, in which case each ORDER BY query must have a LIMIT clause in the SQL statement rather than enforced through this query option.

Default: -1 (no default limit)

Query Options Affecting ORDER BY and LIMIT Clauses

When the LIMIT clause is specified, DEFAULT_ORDER_BY_LIMIT and ABORT_ON_DEFAULT_LIMIT_EXCEEDED have no effect:

[localhost:21000] > select x from three_rows order by x limit 5;
Query: select x from three_rows order by x limit 5
Query finished, fetching results ...
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
+---+
Returned 3 row(s) in 0.27s

When only DEFAULT_ORDER_BY_LIMIT is specified, the result set could be truncated:

[localhost:21000] > set default_order_by_limit=5;
DEFAULT_ORDER_BY_LIMIT set to 5
[localhost:21000] > select x from ten_rows order by x;
Query: select x from ten_rows order by x
Query finished, fetching results ...
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+---+
Returned 5 row(s) in 0.30s

When ABORT_ON_DEFAULT_LIMIT_EXCEEDED is specified, the query is cancelled rather than returning a truncated result set:

[localhost:21000] > set abort_on_default_limit_exceeded=true;
ABORT_ON_DEFAULT_LIMIT_EXCEEDED set to true
[localhost:21000] > select x from ten_rows order by x;
Query: select x from ten_rows order by x
Query aborted, unable to fetch data

Backend 0:DEFAULT_ORDER_BY_LIMIT has been exceeded.

DISABLE_CODEGEN

This is a debug option, intended for diagnosing and working around issues that cause crashes. If a query fails with an "illegal instruction" or other hardware-specific message, try setting DISABLE_CODEGEN=true and running the query again. If the query succeeds only when the DISABLE_CODEGEN option is turned on, submit the problem to Cloudera support and include that detail in the problem report. Do not otherwise run with this setting turned on, because it results in lower overall performance.

Because the code generation phase adds a small amount of overhead for each query, you might turn on the DISABLE_CODEGEN option to achieve maximum throughput when running many short-lived queries against small tables.

Type: BOOLEAN

Default: false (shown as 0 in output of SET command)

EXPLAIN_LEVEL

Setting this option to verbose or 1 enables extra information in the output of the EXPLAIN command. Setting the option to normal or 0 suppresses the extra information. The extended information is especially useful during performance tuning, when you need to confirm if table and column statistics are available for a query. The extended information also helps to check estimated resource usage when you use the resource management feature in CDH 5. See EXPLAIN Statement for details about the extended information and how to use it.

HBASE_CACHE_BLOCKS

Setting this option is equivalent to calling the setCacheBlocks method of the class org.apache.hadoop.hbase.client.Scan, in an HBase Java application. Helps to control the memory pressure on the HBase region server, in conjunction with the HBASE_CACHING query option. See HBASE_CACHING for details.

Type: BOOLEAN

Default: false (shown as 0 in output of SET command)

HBASE_CACHING

Setting this option is equivalent to calling the setCaching method of the class org.apache.hadoop.hbase.client.Scan, in an HBase Java application. Helps to control the memory pressure on the HBase region server, in conjunction with the HBASE_CACHE_BLOCKS query option. See HBASE_CACHE_BLOCKS for details.

Type: BOOLEAN

Default: 0

MAX_ERRORS

Maximum number of non-fatal errors for any particular query that are recorded in the Impala log file. For example, if a billion-row table had a non-fatal data error in every row, you could diagnose the problem without all billion errors being logged. Unspecified or 0 indicates the built-in default value of 1000.

This option only controls how many errors are reported. To specify whether Impala continues or halts when it encounters such errors, use the ABORT_ON_ERROR option.

Default: 0 (meaning 1000 errors)

MAX_IO_BUFFERS

Deprecated query option. Currently has no effect.

Default: 0

MAX_SCAN_RANGE_LENGTH

Maximum length of the scan range. Interacts with the number of HDFS blocks in the table to determine how many CPU cores across the cluster are involved with the processing for a query. (Each core processes one scan range.)

Lowering the value can sometimes increase parallelism if you have unused CPU capacity, but a too-small value can limit query performance because each scan range involves extra overhead.

Only applicable to HDFS tables. Has no effect on Parquet tables. Unspecified or 0 indicates backend default, which is the same as the HDFS block size for each table, typically several megabytes for most file formats, or 1GB for Parquet tables.

Although the scan range can be arbitrarily long, Impala internally uses an 8MB read buffer so that it can query tables with huge block sizes without allocating equivalent blocks of memory.

Default: 0

MEM_LIMIT

When resource management is not enabled, defines the maximum amount of memory a query can allocate on each node. If query processing exceeds the specified memory limit on any node, Impala cancels the query automatically. Memory limits are checked periodically during query processing, so the actual memory in use might briefly exceed the limit without the query being cancelled.

When resource management is enabled in CDH 5, the mechanism for this option changes. If set, it overrides the automatic memory estimate from Impala. Impala requests this amount of memory from YARN on each node, and the query does not proceed until that much memory is available. The actual memory used by the query could be lower, since some queries use much less memory than others. With resource management, the MEM_LIMIT setting acts both as a hard limit on the amount of memory a query can use on any node (enforced by YARN and a guarantee that that much memory will be available on each node while the query is being executed. When resource management is enabled but no MEM_LIMIT setting is specified, Impala estimates the amount of memory needed on each node for each query, requests that much memory from YARN before starting the query, and then internally sets the MEM_LIMIT on each node to the requested amount of memory during the query. Thus, if the query takes more memory than was originally estimated, Impala detects that the MEM_LIMIT is exceeded and cancels the query itself.

Default: 0

NUM_NODES

Used during debugging to limit the number of nodes that process a query. Only accepts the values 0 (meaning all nodes) or 1 (meaning all work is done on the coordinator node). If you are diagnosing a problem that you suspect is due to a timing issue due to distributed query processing, you can set NUM_NODES=1 to verify if the problem still occurs when all the work is done on a single node.

Default: 0

NUM_SCANNER_THREADS

Maximum number of scanner threads (across the whole cluster) used for each query. By default, Impala uses as many cores as are available (one thread per core). You might lower this value if queries are using excessive resources on a busy cluster. Impala imposes a maximum value automatically, so a high value has no practical effect.

Default: 0

PARQUET_COMPRESSION_CODEC

When Impala writes Parquet data files using the INSERT statement, the underlying compression is controlled by the PARQUET_COMPRESSION_CODEC query option. The allowed values for this query option are snappy (the default), gzip, and none. The option value is not case-sensitive. See Snappy and GZip Compression for Parquet Data Files for details and examples.

If the option is set to an unrecognized value, all kinds of queries will fail due to the invalid option setting, not just queries involving Parquet tables.

Default: SNAPPY

PARQUET_FILE_SIZE

Specifies the maximum size of each Parquet data file produced by Impala INSERT statements. For small or partitioned tables where the default Parquet block size of 1GB is much larger than needed for each data file, you can increase parallelism by specifying a smaller size, resulting in more HDFS blocks that can be processed by different nodes. Reducing the file size also reduces the memory required to buffer each block before writing it to disk.

Default: 0 (produces files with a maximum size of 1 gigabyte)

RESERVATION_REQUEST_TIMEOUT (CDH 5 Only)

Maximum number of milliseconds Impala will wait for a reservation to be completely granted or denied. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher with CDH 5.

Default: 300000 (5 minutes)

SUPPORT_START_OVER

Leave this setting false.

Default: false

SYNC_DDL

When enabled, causes any DDL operation such as CREATE TABLE or ALTER TABLE to return only when the changes have been propagated to all other Impala nodes in the cluster by the Impala catalog service. That way, if you issue a subsequent CONNECT statement in impala-shell to connect to a different node in the cluster, you can be sure that other node will already recognize any added or changed tables. (The catalog service automatically broadcasts the DDL changes to all nodes automatically, but without this option there could be a period of inconsistency if you quickly switched to another node.)

Although INSERT is classified as a DML statement, when the SYNC_DDL option is enabled, INSERT statements also delay their completion until all the underlying data and metadata changes are propagated to all Impala nodes. Internally, Impala inserts have similarities with DDL statements in traditional database systems, because they create metadata needed to track HDFS block locations for new files and they potentially add new partitions to partitioned tables.

  Note: Because this option can introduce a delay after each write operation, if you are running a sequence of CREATE DATABASE, CREATE TABLE, ALTER TABLE, INSERT, and similar statements within a setup script, to minimize the overall delay you can enable the SYNC_DDL query option only near the end, before the final DDL statement.

Default: false

V_CPU_CORES (CDH 5 Only)

The number of per-host virtual CPU cores to request from YARN. If set, the query option overrides the automatic estimate from Impala. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher and CDH 5.

Default: 0 (use automatic estimates)

YARN_POOL (CDH 5 Only)

The YARN pool/queue name that queries should be submitted to. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher and CDH 5. Specifies the name of the pool used by resource requests from Impala to the YARN resource management framework.

Default: empty (use the user-to-pool mapping defined by an impalad startup option in the Impala configuration file)