This is the documentation for Cloudera Impala 2.0.0.
Documentation for other versions is available at Cloudera.com.

EXPLAIN Statement

Returns the execution plan for a statement, showing the low-level mechanisms that Impala will use to read the data, divide the work among nodes in the cluster, and transmit intermediate and final results across the network. Use explain followed by a complete SELECT query. For example:

Syntax:

EXPLAIN { select_query | ctas_stmt | insert_stmt }

The select_query is a SELECT statement, optionally prefixed by a WITH clause. See SELECT Statement for details.

The insert_stmt is an INSERT statement that inserts into or overwrites an existing table. It can use either the INSERT ... SELECT or INSERT ... VALUES syntax. See INSERT Statement for details.

The ctas_stmt is a CREATE TABLE statement using the AS SELECT clause, typically abbreviated as a "CTAS" operation. See CREATE TABLE Statement for details.

Usage notes:

You can interpret the output to judge whether the query is performing efficiently, and adjust the query and/or the schema if not. For example, you might change the tests in the WHERE clause, add hints to make join operations more efficient, introduce subqueries, change the order of tables in a join, add or change partitioning for a table, collect column statistics and/or table statistics in Hive, or any other performance tuning steps.

The EXPLAIN output reminds you if table or column statistics are missing from any table involved in the query. These statistics are important for optimizing queries involving large tables or multi-table joins. See COMPUTE STATS Statement for how to gather statistics, and How Impala Uses Statistics for Query Optimization for how to use this information for query tuning.

Read the EXPLAIN plan from bottom to top:
  • The last part of the plan shows the low-level details such as the expected amount of data that will be read, where you can judge the effectiveness of your partitioning strategy and estimate how long it will take to scan a table based on total data size and the size of the cluster.
  • As you work your way up, next you see the operations that will be parallelized and performed on each Impala node.
  • At the higher levels, you see how data flows when intermediate result sets are combined and transmitted from one node to another.
  • See EXPLAIN_LEVEL for details about the EXPLAIN_LEVEL query option, which lets you customize how much detail to show in the EXPLAIN plan depending on whether you are doing high-level or low-level tuning, dealing with logical or physical aspects of the query.

If you come from a traditional database background and are not familiar with data warehousing, keep in mind that Impala is optimized for full table scans across very large tables. The structure and distribution of this data is typically not suitable for the kind of indexing and single-row lookups that are common in OLTP environments. Seeing a query scan entirely through a large table is common, not necessarily an indication of an inefficient query. Of course, if you can reduce the volume of scanned data by orders of magnitude, for example by using a query that affects only certain partitions within a partitioned table, then you might be able to optimize a query so that it executes in seconds rather than minutes.

For more information and examples to help you interpret EXPLAIN output, see Using the EXPLAIN Plan for Performance Tuning.

Extended EXPLAIN output:

For performance tuning of complex queries, and capacity planning (such as using the admission control and resource management features), you can enable more detailed and informative output for the EXPLAIN statement. In the impala-shell interpreter, issue the command SET EXPLAIN_LEVEL=level, where level is an integer from 0 to 3 or corresponding mnemonic values minimal, standard, extended, or verbose.

When extended EXPLAIN output is enabled, EXPLAIN statements print information about estimated memory requirements, minimum number of virtual cores, and so on that you can use to fine-tune the resource management options explained in impalad Startup Options for Resource Management. (The estimated memory requirements are intentionally on the high side, to allow a margin for error, to avoid cancelling a query unnecessarily if you set the MEM_LIMIT option to the estimated memory figure.)

See EXPLAIN_LEVEL for details and examples.

Examples:

This example shows how the standard EXPLAIN output moves from the lowest (physical) level to the higher (logical) levels. The query begins by scanning a certain amount of data; each node performs an aggregation operation (evaluating COUNT(*)) on some subset of data that is local to that node; the intermediate results are transmitted back to the coordinator node (labelled here as the EXCHANGE node); lastly, the intermediate results are summed to display the final result.

[impalad-host:21000] > explain select count(*) from customer_address;
+----------------------------------------------------------+
| Explain String                                           |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=42.00MB VCores=1 |
|                                                          |
| 03:AGGREGATE [MERGE FINALIZE]                            |
| |  output: sum(count(*))                                 |
| |                                                        |
| 02:EXCHANGE [PARTITION=UNPARTITIONED]                    |
| |                                                        |
| 01:AGGREGATE                                             |
| |  output: count(*)                                      |
| |                                                        |
| 00:SCAN HDFS [default.customer_address]                  |
|    partitions=1/1 size=5.25MB                            |
+----------------------------------------------------------+

These examples show how the extended EXPLAIN output becomes more accurate and informative as statistics are gathered by the COMPUTE STATS statement. Initially, much of the information about data size and distribution is marked "unavailable". Impala can determine the raw data size, but not the number of rows or number of distinct values for each column without additional analysis. The COMPUTE STATS statement performs this analysis, so a subsequent EXPLAIN statement has additional information to use in deciding how to optimize the distributed query.

[localhost:21000] > set explain_level=extended;
EXPLAIN_LEVEL set to extended
[localhost:21000] > explain select x from t1;
[localhost:21000] > explain select x from t1;
+----------------------------------------------------------+
| Explain String                                           |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=32.00MB VCores=1 |
|                                                          |
| 01:EXCHANGE [PARTITION=UNPARTITIONED]                    |
| |  hosts=1 per-host-mem=unavailable                      |
| |  tuple-ids=0 row-size=4B cardinality=unavailable       |
| |                                                        |
| 00:SCAN HDFS [default.t2, PARTITION=RANDOM]              |
|    partitions=1/1 size=36B                               |
|    table stats: unavailable                              |
|    column stats: unavailable                             |
|    hosts=1 per-host-mem=32.00MB                          |
|    tuple-ids=0 row-size=4B cardinality=unavailable       |
+----------------------------------------------------------+
[localhost:21000] > compute stats t1;
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 1 column(s). |
+-----------------------------------------+
[localhost:21000] > explain select x from t1;
+----------------------------------------------------------+
| Explain String                                           |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=64.00MB VCores=1 |
|                                                          |
| 01:EXCHANGE [PARTITION=UNPARTITIONED]                    |
| |  hosts=1 per-host-mem=unavailable                      |
| |  tuple-ids=0 row-size=4B cardinality=0                 |
| |                                                        |
| 00:SCAN HDFS [default.t1, PARTITION=RANDOM]              |
|    partitions=1/1 size=36B                               |
|    table stats: 0 rows total                             |
|    column stats: all                                     |
|    hosts=1 per-host-mem=64.00MB                          |
|    tuple-ids=0 row-size=4B cardinality=0                 |
+----------------------------------------------------------+

Cancellation: Cannot be cancelled.