EXEC_SINGLE_NODE_ROWS_THRESHOLD Query Option (CDH 5.3 or higher only)

This setting controls the cutoff point (in terms of number of rows scanned) below which Impala treats a query as a "small" query, turning off optimizations such as parallel execution and native code generation. The overhead for these optimizations is applicable for queries involving substantial amounts of data, but it makes sense to skip them for queries involving tiny amounts of data. Reducing the overhead for small queries allows Impala to complete them more quickly, keeping YARN resources, admission control slots, and so on available for data-intensive queries.

Syntax:

SET EXEC_SINGLE_NODE_ROWS_THRESHOLD=number_of_rows

Type: numeric

Default: 100

Usage notes: Typically, you increase the default value to make this optimization apply to more queries. If incorrect or corrupted table and column statistics cause Impala to apply this optimization incorrectly to queries that actually involve substantial work, you might see the queries being slower as a result of remote reads. In that case, recompute statistics with the COMPUTE STATS or COMPUTE INCREMENTAL STATS statement. If there is a problem collecting accurate statistics, you can turn this feature off by setting the value to -1.

Internal details:

This setting applies to query fragments where the amount of data to scan can be accurately determined, either through table and column statistics, or by the presence of a LIMIT clause. If Impala cannot accurately estimate the size of the input data, this setting does not apply.

In CDH 5.5 / Impala 2.3 and higher, where Impala supports the complex data types STRUCT, ARRAY, and MAP, if a query refers to any column of those types, the small-query optimization is turned off for that query regardless of the EXEC_SINGLE_NODE_ROWS_THRESHOLD setting.

For a query that is determined to be "small", all work is performed on the coordinator node. This might result in some I/O being performed by remote reads. The savings from not distributing the query work and not generating native code are expected to outweigh any overhead from the remote reads.

Added in: CDH 5.13

Examples:

A common use case is to query just a few rows from a table to inspect typical data values. In this example, Impala does not parallelize the query or perform native code generation because the result set is guaranteed to be smaller than the threshold value from this query option:

SET EXEC_SINGLE_NODE_ROWS_THRESHOLD=500;
SELECT * FROM enormous_table LIMIT 300;