This is the documentation for Cloudera Impala 2.0.0.
Documentation for other versions is available at Cloudera.com.

Hints

The Impala SQL dialect supports query hints, for fine-tuning the inner workings of queries. Specify hints as a temporary workaround for expensive queries, where missing statistics or other factors cause inefficient performance.

Syntax:

You can represent the hints as keywords surrounded by [] square brackets; include the brackets in the text of the SQL statement.

SELECT select_list FROM
join_left_hand_table
  JOIN [{BROADCAST|SHUFFLE}]
join_right_hand_table
remainder_of_query;

INSERT insert_clauses
  [{SHUFFLE|NOSHUFFLE}]
  SELECT remainder_of_query;

In Impala 2.0 and higher, or CDH 5.2 and higher, you can also specify the hints inside comments that use either the /* */ or -- notation. Specify a + symbol immediately before the hint name.

SELECT select_list FROM
join_left_hand_table
  JOIN /* +BROADCAST|SHUFFLE */
join_right_hand_table
remainder_of_query;

SELECT select_list FROM
join_left_hand_table
  JOIN -- +BROADCAST|SHUFFLE
join_right_hand_table
remainder_of_query;

INSERT insert_clauses
  /* +SHUFFLE|NOSHUFFLE */
  SELECT remainder_of_query;

INSERT insert_clauses
  -- +SHUFFLE|NOSHUFFLE
  SELECT remainder_of_query;

Hints for join queries:

The [BROADCAST] and [SHUFFLE] hints control the execution strategy for join queries. Specify one of the following constructs immediately after the JOIN keyword in a query:

  • [SHUFFLE] - Makes that join operation use the "partitioned" technique, which divides up corresponding rows from both tables using a hashing algorithm, sending subsets of the rows to other nodes for processing. (The keyword SHUFFLE is used to indicate a "partitioned join", because that type of join is not related to "partitioned tables".) Since the alternative "broadcast" join mechanism is the default when table and index statistics are unavailable, you might use this hint for queries where broadcast joins are unsuitable; typically, partitioned joins are more efficient for joins between large tables of similar size.
  • [BROADCAST] - Makes that join operation use the "broadcast" technique that sends the entire contents of the right-hand table to all nodes involved in processing the join. This is the default mode of operation when table and index statistics are unavailable, so you would typically only need it if stale metadata caused Impala to mistakenly choose a partitioned join operation. Typically, broadcast joins are more efficient in cases where one table is much smaller than the other. (Put the smaller table on the right side of the JOIN operator.)

Hints for INSERT ... SELECT queries:

When inserting into partitioned tables, especially using the Parquet file format, you can include a hint in the INSERT statement to fine-tune the overall performance of the operation and its resource usage:
  • These hints are available in Impala 1.2.2 and higher.
  • You would only use these hints if an INSERT into a partitioned Parquet table was failing due to capacity limits, or if such an INSERT was succeeding but with less-than-optimal performance.
  • To use these hints, put the hint keyword [SHUFFLE] or [NOSHUFFLE] (including the square brackets) after the PARTITION clause, immediately before the SELECT keyword.
  • [SHUFFLE] selects an execution plan that minimizes the number of files being written simultaneously to HDFS, and the number of 1 GB memory buffers holding data for individual partitions. Thus it reduces overall resource usage for the INSERT operation, allowing some INSERT operations to succeed that otherwise would fail. It does involve some data transfer between the nodes so that the data files for a particular partition are all constructed on the same node.
  • [NOSHUFFLE] selects an execution plan that might be faster overall, but might also produce a larger number of small data files or exceed capacity limits, causing the INSERT operation to fail. Use [SHUFFLE] in cases where an INSERT statement fails or runs inefficiently due to all nodes attempting to construct data for all partitions.
  • Impala automatically uses the [SHUFFLE] method if any partition key column in the source table, mentioned in the INSERT ... SELECT query, does not have column statistics. In this case, only the [NOSHUFFLE] hint would have any effect.
  • If column statistics are available for all partition key columns in the source table mentioned in the INSERT ... SELECT query, Impala chooses whether to use the [SHUFFLE] or [NOSHUFFLE] technique based on the estimated number of distinct values in those columns and the number of nodes involved in the INSERT operation. In this case, you might need the [SHUFFLE] or the [NOSHUFFLE] hint to override the execution plan selected by Impala.

Usage notes:

To reduce the need to use hints, run the COMPUTE STATS statement against all tables involved in joins, or used as the source tables for INSERT ... SELECT operations where the destination is a partitioned Parquet table. Do this operation after loading data or making substantial changes to the data within each table. Having up-to-date statistics helps Impala choose more efficient query plans without the need for hinting.

To see which join strategy is used for a particular query, examine the EXPLAIN output for that query.

Restrictions:

  Note:

Because hints can prevent queries from taking advantage of new metadata or improvements in query planning, use them only when required to work around performance issues, and be prepared to remove them when they are no longer required, such as after a new Impala release or bug fix.

In particular, the [BROADCAST] and [SHUFFLE] hints are expected to be needed much less frequently in Impala 1.2.2 and higher, because the join order optimization feature in combination with the COMPUTE STATS statement now automatically choose join order and join mechanism without the need to rewrite the query and add hints. See Performance Considerations for Join Queries for details.

Compatibility:

The hints embedded within -- comments are compatible with Hive queries. The hints embedded within /* */ comments or [ ] square brackets are not recognized by or not compatible with Hive. For example, Hive raises an error for Impala hints within /* */ comments because it does not recognize the Impala hint names.

Considerations for views:

If you use a hint in the query that defines a view, the hint is preserved when you query the view. Impala internally rewrites all hints in views to use the -- comment notation, so that Hive can query such views without errors due to unrecognized hint names.

Examples:

For example, this query joins a large customer table with a small lookup table of less than 100 rows. The right-hand table can be broadcast efficiently to all nodes involved in the join. Thus, you would use the [broadcast] hint to force a broadcast join strategy:

select customer.address, state_lookup.state_name
  from customer join [broadcast] state_lookup
  on customer.state_id = state_lookup.state_id;

This query joins two large tables of unpredictable size. You might benchmark the query with both kinds of hints and find that it is more efficient to transmit portions of each table to other nodes for processing. Thus, you would use the [shuffle] hint to force a partitioned join strategy:

select weather.wind_velocity, geospatial.altitude
  from weather join [shuffle] geospatial
  on weather.lat = geospatial.lat and weather.long = geospatial.long;

For joins involving three or more tables, the hint applies to the tables on either side of that specific JOIN keyword. The joins are processed from left to right. For example, this query joins t1 and t2 using a partitioned join, then joins that result set to t3 using a broadcast join:

select t1.name, t2.id, t3.price
  from t1 join [shuffle] t2 join [broadcast] t3
  on t1.id = t2.id and t2.id = t3.id;

For more background information and performance considerations for join queries, see Joins.