This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Query Options for the SET Command

You can specify the following options within an impala-shell session, and those settings affect all queries issued from that session.

Some query options are useful in day-to-day operations for improving usability, performance, or flexibility.

Other query options control special-purpose aspects of Impala operation and are intended primarily for advanced debugging or troubleshooting.

  Note: Currently, there is no way to set query options directly through the JDBC and ODBC interfaces. For JDBC and ODBC applications, you can execute queries that need specific query options by invoking impala-shell to run a script that starts with SET commands, or by defining query options globally through the impalad startup flag --default_query_options.

Continue reading:

ABORT_ON_DEFAULT_LIMIT_EXCEEDED

Used in conjunction with DEFAULT_ORDER_BY_LIMIT to make sure results of ORDER BY queries are not truncated by accident. If the result set of an ORDER BY query with no LIMIT clause exceeds the value of the DEFAULT_ORDER_BY_LIMIT option, the query is cancelled rather than returning the incomplete result set.

Type: Boolean

Default: false (shown as 0 in output of SET command)

Query Options Affecting ORDER BY and LIMIT Clauses

When the LIMIT clause is specified, DEFAULT_ORDER_BY_LIMIT and ABORT_ON_DEFAULT_LIMIT_EXCEEDED have no effect:

[localhost:21000] > select x from three_rows order by x limit 5;
Query: select x from three_rows order by x limit 5
Query finished, fetching results ...
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
+---+
Returned 3 row(s) in 0.27s

When only DEFAULT_ORDER_BY_LIMIT is specified, the result set could be truncated:

[localhost:21000] > set default_order_by_limit=5;
DEFAULT_ORDER_BY_LIMIT set to 5
[localhost:21000] > select x from ten_rows order by x;
Query: select x from ten_rows order by x
Query finished, fetching results ...
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+---+
Returned 5 row(s) in 0.30s

When ABORT_ON_DEFAULT_LIMIT_EXCEEDED is specified, the query is cancelled rather than returning a truncated result set:

[localhost:21000] > set abort_on_default_limit_exceeded=true;
ABORT_ON_DEFAULT_LIMIT_EXCEEDED set to true
[localhost:21000] > select x from ten_rows order by x;
Query: select x from ten_rows order by x
Query aborted, unable to fetch data

Backend 0:DEFAULT_ORDER_BY_LIMIT has been exceeded.

ABORT_ON_ERROR

When this option is enabled, Impala cancels a query immediately when any of the nodes encounters an error, rather than continuing and possibly returning incomplete results. This option is enabled by default to hep you gather maximum diagnostic information when an errors occurs, for example, whether the same problem occurred on all nodes or only a single node. Currently, the errors that Impala can skip over involve data corruption, such as a column that contains a string value when expected to contain an integer value.

To control how much logging Impala does for non-fatal errors when ABORT_ON_ERROR is turned off, use the MAX_ERRORS option.

Type: BOOLEAN

Default: false (shown as 0 in output of SET command)

ALLOW_UNSUPPORTED_FORMATS

An obsolete query option from early work on support for file formats. Do not use. Might be removed in the future.

Type: BOOLEAN

Default: false (shown as 0 in output of SET command)

BATCH_SIZE

Number of rows evaluated at a time by SQL operators. Unspecified or a size of 0 uses a predefined default size. Primarily for Cloudera testing.

Default: 0 (meaning 1024)

DEBUG_ACTION

Introduces artificial problem conditions within queries. For internal Cloudera debugging and troubleshooting.

Type: STRING

Default: empty string

DEFAULT_ORDER_BY_LIMIT

Impala queries that use the ORDER BY clause must also include a LIMIT clause, to avoid accidentally producing huge result sets that must be sorted. (Sorting a huge result set is a memory-intensive operation, and no results are returned until the sort finishes, making the query seem less responsive.)

To avoid changing your source code to add a LIMIT clause to every query that uses ORDER BY, you can set the DEFAULT_ORDER_BY_LIMIT query option to the largest number of rows you would ever want or expect to be returned by an ORDER BY query. For example, you might set DEFAULT_ORDER_BY_LIMIT=10 immediately before issuing a query where you only care about the top 10 results. Or you might set DEFAULT_ORDER_BY_LIMIT=1000000 as a sanity check, to make sure any ORDER BY queries never return more than a million rows by accident; the cap would have no effect on queries that return less than a million rows.

The default value of -1 signifies no upper limit on the size of the result set, in which case each ORDER BY query must have a LIMIT clause in the SQL statement rather than enforced through this query option.

Default: -1 (no default limit)

Query Options Affecting ORDER BY and LIMIT Clauses

When the LIMIT clause is specified, DEFAULT_ORDER_BY_LIMIT and ABORT_ON_DEFAULT_LIMIT_EXCEEDED have no effect:

[localhost:21000] > select x from three_rows order by x limit 5;
Query: select x from three_rows order by x limit 5
Query finished, fetching results ...
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
+---+
Returned 3 row(s) in 0.27s

When only DEFAULT_ORDER_BY_LIMIT is specified, the result set could be truncated:

[localhost:21000] > set default_order_by_limit=5;
DEFAULT_ORDER_BY_LIMIT set to 5
[localhost:21000] > select x from ten_rows order by x;
Query: select x from ten_rows order by x
Query finished, fetching results ...
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+---+
Returned 5 row(s) in 0.30s

When ABORT_ON_DEFAULT_LIMIT_EXCEEDED is specified, the query is cancelled rather than returning a truncated result set:

[localhost:21000] > set abort_on_default_limit_exceeded=true;
ABORT_ON_DEFAULT_LIMIT_EXCEEDED set to true
[localhost:21000] > select x from ten_rows order by x;
Query: select x from ten_rows order by x
Query aborted, unable to fetch data

Backend 0:DEFAULT_ORDER_BY_LIMIT has been exceeded.

DISABLE_CODEGEN

This is a debug option, intended for diagnosing and working around issues that cause crashes. If a query fails with an "illegal instruction" or other hardware-specific message, try setting DISABLE_CODEGEN=true and running the query again. If the query succeeds only when the DISABLE_CODEGEN option is turned on, submit the problem to Cloudera support and include that detail in the problem report. Do not otherwise run with this setting turned on, because it results in lower overall performance.

Because the code generation phase adds a small amount of overhead for each query, you might turn on the DISABLE_CODEGEN option to achieve maximum throughput when running many short-lived queries against small tables.

Type: BOOLEAN

Default: false (shown as 0 in output of SET command)

EXPLAIN_LEVEL

Controls the amount of detail provided in the output of the EXPLAIN statement. The basic output can help you identify high-level performance issues such as scanning a higher volume of data or more partitions than you expect. The higher levels of detail show how intermediate results flow between nodes and how different SQL operations such as ORDER BY, GROUP BY, joins, and WHERE clauses are implemented within a distributed query.

Type: STRING or INT

Default: 1 (might be incorrectly reported as 0 in output of SET command)

Arguments:

The allowed range of numeric values for this option is 0 to 3:

  • 0 or MINIMAL: A barebones list, one line per operation. Primarily useful for checking the join order in very long queries where the regular EXPLAIN output is too long to read easily.
  • 1 or STANDARD: The default level of detail, showing the logical way that work is split up for the distributed query.
  • 2 or EXTENDED: Includes additional detail about how the query planner uses statistics in its decision-making process, to understand how a query could be tuned by gathering statistics, using query hints, adding or removing predicates, and so on.
  • 3 or VERBOSE: The maximum level of detail, showing how work is split up within each node into "query fragments" that are connected in a pipeline. This extra detail is primarily useful for low-level performance testing and tuning within Impala itself, rather than for rewriting the SQL code at the user level.
  Note: Prior to Impala 1.3, the allowed argument range for EXPLAIN_LEVEL was 0 to 1: level 0 had the mnemonic NORMAL, and level 1 was VERBOSE. In Impala 1.3 and higher, NORMAL is not a valid mnemonic value, and VERBOSE still applies to the highest level of detail but now corresponds to level 3. You might need to adjust the values if you have any older impala-shell script files that set the EXPLAIN_LEVEL query option.

Changing the value of this option controls the amount of detail in the output of the EXPLAIN statement. The extended information from level 2 or 3 is especially useful during performance tuning, when you need to confirm whether the work for the query is distributed the way you expect, particularly for the most resource-intensive operations such as join queries against large tables, queries against tables with large numbers of partitions, and insert operations for Parquet tables. The extended information also helps to check estimated resource usage when you use the admission control or resource management features explained in Impala Administration. See EXPLAIN Statement for the syntax of the EXPLAIN statement, and Using the EXPLAIN Plan for Performance Tuning for details about how to use the extended information.

Usage notes:

As always, read the EXPLAIN output from bottom to top. The lowest lines represent the initial work of the query (scanning data files), the lines in the middle represent calculations done on each node and how intermediate results are transmitted from one node to another, and the topmost lines represent the final results being sent back to the coordinator node.

The numbers in the left column are generated internally during the initial planning phase and do not represent the actual order of operations, so it is not significant if they appear out of order in the EXPLAIN output.

At all EXPLAIN levels, the plan contains a warning if any tables in the query are missing statistics. Use the COMPUTE STATS statement to gather statistics for each table and suppress this warning. See How Impala Uses Statistics for Query Optimization for details about how the statistics help query performance.

The PROFILE command in impala-shell always starts with an explain plan showing full detail, the same as with EXPLAIN_LEVEL=3.

Examples:

These examples use a trivial, empty table to illustrate how the essential aspects of query planning are shown in EXPLAIN output:

[localhost:21000] > create table t1 (x int, s string);
[localhost:21000] > set explain_level=1;
[localhost:21000] > explain select count(*) from t1;
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=10.00MB VCores=1                           |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| explain_plan.t1                                                                    |
|                                                                                    |
| 03:AGGREGATE [MERGE FINALIZE]                                                      |
| |  output: sum(count(*))                                                           |
| |                                                                                  |
| 02:EXCHANGE [PARTITION=UNPARTITIONED]                                              |
| |                                                                                  |
| 01:AGGREGATE                                                                       |
| |  output: count(*)                                                                |
| |                                                                                  |
| 00:SCAN HDFS [explain_plan.t1]                                                     |
|    partitions=1/1 size=0B                                                          |
+------------------------------------------------------------------------------------+
[localhost:21000] > explain select * from t1;
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0             |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| explain_plan.t1                                                                    |
|                                                                                    |
| 01:EXCHANGE [PARTITION=UNPARTITIONED]                                              |
| |                                                                                  |
| 00:SCAN HDFS [explain_plan.t1]                                                     |
|    partitions=1/1 size=0B                                                          |
+------------------------------------------------------------------------------------+
[localhost:21000] > set explain_level=2;
[localhost:21000] > explain select * from t1;
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0             |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| explain_plan.t1                                                                    |
|                                                                                    |
| 01:EXCHANGE [PARTITION=UNPARTITIONED]                                              |
| |  hosts=0 per-host-mem=unavailable                                                |
| |  tuple-ids=0 row-size=19B cardinality=unavailable                                |
| |                                                                                  |
| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM]                                   |
|    partitions=1/1 size=0B                                                          |
|    table stats: unavailable                                                        |
|    column stats: unavailable                                                       |
|    hosts=0 per-host-mem=0B                                                         |
|    tuple-ids=0 row-size=19B cardinality=unavailable                                |
+------------------------------------------------------------------------------------+
[localhost:21000] > set explain_level=3;
[localhost:21000] > explain select * from t1;
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0             |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| explain_plan.t1                                                                    |
|                                                                                    |
| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED]                                        |
|   01:EXCHANGE [PARTITION=UNPARTITIONED]                                            |
|      hosts=0 per-host-mem=unavailable                                              |
|      tuple-ids=0 row-size=19B cardinality=unavailable                              |
|                                                                                    |
| F00:PLAN FRAGMENT [PARTITION=RANDOM]                                               |
|   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, PARTITION=UNPARTITIONED]             |
|   00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM]                                 |
|      partitions=1/1 size=0B                                                        |
|      table stats: unavailable                                                      |
|      column stats: unavailable                                                     |
|      hosts=0 per-host-mem=0B                                                       |
|      tuple-ids=0 row-size=19B cardinality=unavailable                              |
+------------------------------------------------------------------------------------+

As the warning message demonstrates, most of the information needed for Impala to do efficient query planning, and for you to understand the performance characteristics of the query, requires running the COMPUTE STATS statement for the table:

[localhost:21000] > compute stats t1;
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 2 column(s). |
+-----------------------------------------+
[localhost:21000] > explain select * from t1;
+------------------------------------------------------------------------+
| Explain String                                                         |
+------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
|                                                                        |
| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED]                            |
|   01:EXCHANGE [PARTITION=UNPARTITIONED]                                |
|      hosts=0 per-host-mem=unavailable                                  |
|      tuple-ids=0 row-size=20B cardinality=0                            |
|                                                                        |
| F00:PLAN FRAGMENT [PARTITION=RANDOM]                                   |
|   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, PARTITION=UNPARTITIONED] |
|   00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM]                     |
|      partitions=1/1 size=0B                                            |
|      table stats: 0 rows total                                         |
|      column stats: all                                                 |
|      hosts=0 per-host-mem=0B                                           |
|      tuple-ids=0 row-size=20B cardinality=0                            |
+------------------------------------------------------------------------+

Joins and other complicated, multi-part queries are the ones where you most commonly need to examine the EXPLAIN output and customize the amount of detail in the output. This example shows the default EXPLAIN output for a three-way join query, then the equivalent output with a [SHUFFLE] hint to change the join mechanism between the first two tables from a broadcast join to a shuffle join.

[localhost:21000] > set explain_level=1;
[localhost:21000] > explain select one.*, two.*, three.* from t1 one, t1 two, t1 three where one.x = two.x and two.x = three.x;
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=4.00GB VCores=3                            |
|                                                                                    |
| 07:EXCHANGE [PARTITION=UNPARTITIONED]                                              |
| |                                                                                  |
| 04:HASH JOIN [INNER JOIN, BROADCAST]                                               |
| |  hash predicates: two.x = three.x                                                |
| |                                                                                  |
| |--06:EXCHANGE [BROADCAST]                                                         |
| |  |                                                                               |
| |  02:SCAN HDFS [explain_plan.t1 three]                                            |
| |     partitions=1/1 size=0B                                                       |
| |                                                                                  |
| 03:HASH JOIN [INNER JOIN, BROADCAST]                                               |
| |  hash predicates: one.x = two.x                                                  |
| |                                                                                  |
| |--05:EXCHANGE [BROADCAST]                                                         |
| |  |                                                                               |
| |  01:SCAN HDFS [explain_plan.t1 two]                                              |
| |     partitions=1/1 size=0B                                                       |
| |                                                                                  |
| 00:SCAN HDFS [explain_plan.t1 one]                                                 |
|    partitions=1/1 size=0B                                                          |
+------------------------------------------------------------------------------------+
[localhost:21000] > explain select one.*, two.*, three.* from t1 one join [shuffle] t1 two join t1 three where one.x = two.x and two.x = three.x;
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=4.00GB VCores=3                            |
|                                                                                    |
| 08:EXCHANGE [PARTITION=UNPARTITIONED]                                              |
| |                                                                                  |
| 04:HASH JOIN [INNER JOIN, BROADCAST]                                               |
| |  hash predicates: two.x = three.x                                                |
| |                                                                                  |
| |--07:EXCHANGE [BROADCAST]                                                         |
| |  |                                                                               |
| |  02:SCAN HDFS [explain_plan.t1 three]                                            |
| |     partitions=1/1 size=0B                                                       |
| |                                                                                  |
| 03:HASH JOIN [INNER JOIN, PARTITIONED]                                             |
| |  hash predicates: one.x = two.x                                                  |
| |                                                                                  |
| |--06:EXCHANGE [PARTITION=HASH(two.x)]                                             |
| |  |                                                                               |
| |  01:SCAN HDFS [explain_plan.t1 two]                                              |
| |     partitions=1/1 size=0B                                                       |
| |                                                                                  |
| 05:EXCHANGE [PARTITION=HASH(one.x)]                                                |
| |                                                                                  |
| 00:SCAN HDFS [explain_plan.t1 one]                                                 |
|    partitions=1/1 size=0B                                                          |
+------------------------------------------------------------------------------------+

For a join involving many different tables, the default EXPLAIN output might stretch over several pages, and the only details you care about might be the join order and the mechanism (broadcast or shuffle) for joining each pair of tables. In that case, you might set EXPLAIN_LEVEL to its lowest value of 0, to focus on just the join order and join mechanism for each stage. The following example shows how the rows from the first and second joined tables are hashed and divided among the nodes of the cluster for further filtering; then the entire contents of the third table are broadcast to all nodes for the final stage of join processing.

[localhost:21000] > set explain_level=0;
[localhost:21000] > explain select one.*, two.*, three.* from t1 one join [shuffle] t1 two join t1 three where one.x = two.x and two.x = three.x;
+---------------------------------------------------------+
| Explain String                                          |
+---------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
|                                                         |
| 08:EXCHANGE [PARTITION=UNPARTITIONED]                   |
| 04:HASH JOIN [INNER JOIN, BROADCAST]                    |
| |--07:EXCHANGE [BROADCAST]                              |
| |  02:SCAN HDFS [explain_plan.t1 three]                 |
| 03:HASH JOIN [INNER JOIN, PARTITIONED]                  |
| |--06:EXCHANGE [PARTITION=HASH(two.x)]                  |
| |  01:SCAN HDFS [explain_plan.t1 two]                   |
| 05:EXCHANGE [PARTITION=HASH(one.x)]                     |
| 00:SCAN HDFS [explain_plan.t1 one]                      |
+---------------------------------------------------------+

HBASE_CACHE_BLOCKS

Setting this option is equivalent to calling the setCacheBlocks method of the class org.apache.hadoop.hbase.client.Scan, in an HBase Java application. Helps to control the memory pressure on the HBase region server, in conjunction with the HBASE_CACHING query option. See HBASE_CACHING for details.

Type: BOOLEAN

Default: false (shown as 0 in output of SET command)

HBASE_CACHING

Setting this option is equivalent to calling the setCaching method of the class org.apache.hadoop.hbase.client.Scan, in an HBase Java application. Helps to control the memory pressure on the HBase region server, in conjunction with the HBASE_CACHE_BLOCKS query option. See HBASE_CACHE_BLOCKS for details.

Type: BOOLEAN

Default: 0

MAX_ERRORS

Maximum number of non-fatal errors for any particular query that are recorded in the Impala log file. For example, if a billion-row table had a non-fatal data error in every row, you could diagnose the problem without all billion errors being logged. Unspecified or 0 indicates the built-in default value of 1000.

This option only controls how many errors are reported. To specify whether Impala continues or halts when it encounters such errors, use the ABORT_ON_ERROR option.

Default: 0 (meaning 1000 errors)

MAX_IO_BUFFERS

Deprecated query option. Currently has no effect.

Default: 0

MAX_SCAN_RANGE_LENGTH

Maximum length of the scan range. Interacts with the number of HDFS blocks in the table to determine how many CPU cores across the cluster are involved with the processing for a query. (Each core processes one scan range.)

Lowering the value can sometimes increase parallelism if you have unused CPU capacity, but a too-small value can limit query performance because each scan range involves extra overhead.

Only applicable to HDFS tables. Has no effect on Parquet tables. Unspecified or 0 indicates backend default, which is the same as the HDFS block size for each table, typically several megabytes for most file formats, or 1 GB for Parquet tables.

Although the scan range can be arbitrarily long, Impala internally uses an 8 MB read buffer so that it can query tables with huge block sizes without allocating equivalent blocks of memory.

Default: 0

MEM_LIMIT

When resource management is not enabled, defines the maximum amount of memory a query can allocate on each node. If query processing exceeds the specified memory limit on any node, Impala cancels the query automatically. Memory limits are checked periodically during query processing, so the actual memory in use might briefly exceed the limit without the query being cancelled.

When resource management is enabled in CDH 5, the mechanism for this option changes. If set, it overrides the automatic memory estimate from Impala. Impala requests this amount of memory from YARN on each node, and the query does not proceed until that much memory is available. The actual memory used by the query could be lower, since some queries use much less memory than others. With resource management, the MEM_LIMIT setting acts both as a hard limit on the amount of memory a query can use on any node (enforced by YARN and a guarantee that that much memory will be available on each node while the query is being executed. When resource management is enabled but no MEM_LIMIT setting is specified, Impala estimates the amount of memory needed on each node for each query, requests that much memory from YARN before starting the query, and then internally sets the MEM_LIMIT on each node to the requested amount of memory during the query. Thus, if the query takes more memory than was originally estimated, Impala detects that the MEM_LIMIT is exceeded and cancels the query itself.

Default: 0

NUM_NODES

Used during debugging to limit the number of nodes that process a query. Only accepts the values 0 (meaning all nodes) or 1 (meaning all work is done on the coordinator node). If you are diagnosing a problem that you suspect is due to a timing issue due to distributed query processing, you can set NUM_NODES=1 to verify if the problem still occurs when all the work is done on a single node.

Default: 0

NUM_SCANNER_THREADS

Maximum number of scanner threads (on each node) used for each query. By default, Impala uses as many cores as are available (one thread per core). You might lower this value if queries are using excessive resources on a busy cluster. Impala imposes a maximum value automatically, so a high value has no practical effect.

Default: 0

  Note: Currently, a known issue (IMPALA-488) could cause excessive memory usage during a COMPUTE STATS operation on a Parquet table. As a workaround, issue the command SET NUM_SCANNER_THREADS=2 in impala-shell before issuing the COMPUTE STATS statement. Then issue UNSET NUM_SCANNER_THREADS before continuing with queries.

PARQUET_COMPRESSION_CODEC

When Impala writes Parquet data files using the INSERT statement, the underlying compression is controlled by the PARQUET_COMPRESSION_CODEC query option. The allowed values for this query option are snappy (the default), gzip, and none. The option value is not case-sensitive. See Snappy and GZip Compression for Parquet Data Files for details and examples.

If the option is set to an unrecognized value, all kinds of queries will fail due to the invalid option setting, not just queries involving Parquet tables.

Default: SNAPPY

PARQUET_FILE_SIZE

Specifies the maximum size of each Parquet data file produced by Impala INSERT statements. For small or partitioned tables where the default Parquet block size of 1 GB is much larger than needed for each data file, you can increase parallelism by specifying a smaller size, resulting in more HDFS blocks that can be processed by different nodes. Reducing the file size also reduces the memory required to buffer each block before writing it to disk.

Default: 0 (produces files with a maximum size of 1 gigabyte)

REQUEST_POOL

The pool or queue name that queries should be submitted to. Only applies when you enable the Impala admission control feature (CDH 4 or CDH 5; see Admission Control and Query Queuing), or the YARN resource management feature (CDH 5 only; see Using YARN Resource Management with Impala (CDH 5 Only)). Specifies the name of the pool used by requests from Impala to the resource manager.

Formerly known as YARN_POOL during the CDH 5 beta period. Renamed to reflect that it can be used both with YARN and with the lightweight admission control feature introduced in Impala 1.3.

Default: empty (use the user-to-pool mapping defined by an impalad startup option in the Impala configuration file)

RESERVATION_REQUEST_TIMEOUT (CDH 5 Only)

Maximum number of milliseconds Impala will wait for a reservation to be completely granted or denied. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher with CDH 5.

Default: 300000 (5 minutes)

SUPPORT_START_OVER

Leave this setting false.

Default: false

SYNC_DDL

When enabled, causes any DDL operation such as CREATE TABLE or ALTER TABLE to return only when the changes have been propagated to all other Impala nodes in the cluster by the Impala catalog service. That way, if you issue a subsequent CONNECT statement in impala-shell to connect to a different node in the cluster, you can be sure that other node will already recognize any added or changed tables. (The catalog service automatically broadcasts the DDL changes to all nodes automatically, but without this option there could be a period of inconsistency if you quickly switched to another node.)

Although INSERT is classified as a DML statement, when the SYNC_DDL option is enabled, INSERT statements also delay their completion until all the underlying data and metadata changes are propagated to all Impala nodes. Internally, Impala inserts have similarities with DDL statements in traditional database systems, because they create metadata needed to track HDFS block locations for new files and they potentially add new partitions to partitioned tables.

  Note: Because this option can introduce a delay after each write operation, if you are running a sequence of CREATE DATABASE, CREATE TABLE, ALTER TABLE, INSERT, and similar statements within a setup script, to minimize the overall delay you can enable the SYNC_DDL query option only near the end, before the final DDL statement.

Default: false

V_CPU_CORES (CDH 5 Only)

The number of per-host virtual CPU cores to request from YARN. If set, the query option overrides the automatic estimate from Impala. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher and CDH 5.

Default: 0 (use automatic estimates)

Page generated September 3, 2015.