This is the documentation for Cloudera Impala 2.1.x.
Documentation for other versions is available at Cloudera.com.

How Impala Uses Statistics for Query Optimization

Impala can do better optimization for complex or multi-table queries when it has access to statistics about the volume of data and how the values are distributed. Impala uses this information to help parallelize and distribute the work for a query. The following sections describe the categories of statistics Impala can work with, and how to produce them and keep them up to date.

Originally, Impala relied on the Hive mechanism for collecting statistics, through the Hive ANALYZE TABLE statement which initiates a MapReduce job. For better user-friendliness and reliability, Impala implements its own COMPUTE STATS statement in Impala 1.2.2 and higher, along with the DROP STATS, SHOW TABLE STATS, and SHOW COLUMN STATS statements.

Continue reading:

Overview of Table Statistics

The Impala query planner can make use of statistics about entire tables and partitions when that metadata is available in the metastore database. This metadata is used on its own for certain optimizations, and used in combination with column statistics for other optimizations.

To gather table statistics after loading data into a table or partition, use one of the following techniques:

  • Issue the statement COMPUTE STATS in Impala. This statement, available in Impala 1.2.2 and higher, is the preferred method because:
    • It gathers table statistics and statistics for all partitions and columns in a single operation.
    • It does not rely on any special Hive settings, metastore configuration, or separate database to hold the statistics.
    • If you need to adjust statistics incrementally for an existing table, such as after adding a partition or inserting new data, you can use an ALTER TABLE statement such as:
      alter table analysis_data set tblproperties('numRows'='new_value', 'STATS_GENERATED_VIA_STATS_TASK' = 'true');
      to update that one value numeric property rather than re-processing the whole table. (The requirement to include the STATS_GENERATED_VIA_STATS_TASK property is relatively new, as a result of the issue HIVE-8648 for the Hive metastore.)
  • Load the data through the INSERT OVERWRITE statement in Hive, while the Hive setting hive.stats.autogather is enabled.
  • Issue an ANALYZE TABLE statement in Hive, for the entire table or a specific partition.
    ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [NOSCAN];
    For example, to gather statistics for a non-partitioned table:
    ANALYZE TABLE customer COMPUTE STATISTICS;
    To gather statistics for a store table partitioned by state and city, and both of its partitions:
    ANALYZE TABLE store PARTITION(s_state, s_county) COMPUTE STATISTICS;
    To gather statistics for the store table and only the partitions for California:
    ANALYZE TABLE store PARTITION(s_state='CA', s_county) COMPUTE STATISTICS;

To check that table statistics are available for a table, and see the details of those statistics, use the statement SHOW TABLE STATS table_name. See SHOW Statement for details.

If you use the Hive-based methods of gathering statistics, see the Hive wiki for information about the required configuration on the Hive side. Cloudera recommends using the Impala COMPUTE STATS statement to avoid potential configuration and scalability issues with the statistics-gathering process.

Overview of Column Statistics

The Impala query planner can make use of statistics about individual columns when that metadata is available in the metastore database. This technique is most valuable for columns compared across tables in join queries, to help estimate how many rows the query will retrieve from each table. These statistics are also important for correlated subqueries using the EXISTS() or IN() operators, which are processed internally the same way as join queries.

  Note:

For column statistics to be effective in Impala, you also need to have table statistics for the applicable tables, as described in Overview of Table Statistics. When you use the Impala COMPUTE STATS statement, both table and column statistics are automatically gathered at the same time, for all columns in the table.

Currently, the COMPUTE STATS statement under CDH 4 does not store any statistics for DECIMAL columns. When Impala runs under CDH 5, which has better support for DECIMAL in the metastore database, COMPUTE STATS does collect statistics for DECIMAL columns and Impala uses the statistics to optimize query performance.

  Note: Prior to Impala 1.4.0, COMPUTE STATS counted the number of NULL values in each column and recorded that figure in the metastore database. Because Impala does not currently make use of the NULL count during query planning, Impala 1.4.0 and higher speeds up the COMPUTE STATS statement by skipping this NULL counting.

To check whether column statistics are available for a particular set of columns, use the SHOW COLUMN STATS table_name statement, or check the extended EXPLAIN output for a query against that table that refers to those columns. See SHOW Statement and EXPLAIN Statement for details.

Overview of Incremental Statistics

In Impala 2.1.0 and higher, you can use the syntax COMPUTE INCREMENTAL STATS and DROP INCREMENTAL STATS. The INCREMENTAL clauses work with incremental statistics, a specialized feature for partitioned tables that are large or frequently updated with new partitions.

When you compute incremental statistics for a partitioned table, by default Impala only processes those partitions that do not yet have incremental statistics. By processing only newly added partitions, you can keep statistics up to date for large partitioned tables, without incurring the overhead of reprocessing the entire table each time.

You can also compute or drop statistics for a single partition by including a PARTITION clause in the COMPUTE INCREMENTAL STATS or DROP INCREMENTAL STATS statement.

The metadata for incremental statistics is handled differently from the original style of statistics:

  • If you have an existing partitioned table for which you have already computed statistics, issuing COMPUTE INCREMENTAL STATS without a partition clause causes Impala to rescan the entire table. Once the incremental statistics are computed, any future COMPUTE INCREMENTAL STATS statements only scan any new partitions and any partitions where you performed DROP INCREMENTAL STATS.

  • The SHOW TABLE STATS and SHOW PARTITIONS statements now include an additional column showing whether incremental statistics are available for each column. A partition could already be covered by the original type of statistics based on a prior COMPUTE STATS statement, as indicated by a value other than -1 under the #Rows column. Impala query planning uses either kind of statistics when available.

  • COMPUTE INCREMENTAL STATS takes more time than COMPUTE STATS for the same volume of data. Therefore it is most suitable for tables with large data volume where new partitions are added frequently, making it impractical to run a full COMPUTE STATS operation for each new partition. For unpartitioned tables, or partitioned tables that are loaded once and not updated with new partitions, use the original COMPUTE STATS syntax.

  • COMPUTE INCREMENTAL STATS uses some memory in the catalogd process, proportional to the number of partitions and number of columns in the applicable table. The memory overhead is approximately 400 bytes for each column in each partition. This memory is reserved in the catalogd daemon, the statestored daemon, and in each instance of the impalad daemon.

  • In cases where new files are added to an existing partition, issue a REFRESH statement for the table, followed by a DROP INCREMENTAL STATS and COMPUTE INCREMENTAL STATS sequence for the changed partition.

  • The DROP INCREMENTAL STATS statement operates only on a single partition at a time. To remove statistics (whether incremental or not) from all partitions of a table, issue a DROP STATS statement with no INCREMENTAL or PARTITION clauses.

The following considerations apply to incremental statistics when the structure of an existing table is changed (known as schema evolution):

  • If you use an ALTER TABLE statement to drop a column, the existing statistics remain valid and COMPUTE INCREMENTAL STATS does not rescan any partitions.

  • If you use an ALTER TABLE statement to add a column, Impala rescans all partitions and fills in the appropriate column-level values the next time you run COMPUTE INCREMENTAL STATS.

  • If you use an ALTER TABLE statement to change the data type of a column, Impala rescans all partitions and fills in the appropriate column-level values the next time you run COMPUTE INCREMENTAL STATS.

  • If you use an ALTER TABLE statement to change the file format of a table, the existing statistics remain valid and a subsequent COMPUTE INCREMENTAL STATS does not rescan any partitions.

See COMPUTE STATS Statement and DROP STATS Statement for syntax details.

Detecting Missing Statistics

You can check whether a specific table has statistics using the SHOW TABLE STATS statement (for any table) or the SHOW PARTITIONS statement (for a partitioned table). Both statements display the same information. If a table or a partition does not have any statistics, the #Rows field contains -1. Once you compute statistics for the table or partition, the #Rows field changes to an accurate value.

The following example shows a table that initially does not have any statistics. The SHOW TABLE STATS statement displays different values for #Rows before and after the COMPUTE STATS operation.

[localhost:21000] > create table no_stats (x int);
[localhost:21000] > show table stats no_stats;
+-------+--------+------+--------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+--------+------+--------------+--------+-------------------+
| -1    | 0      | 0B   | NOT CACHED   | TEXT   | false             |
+-------+--------+------+--------------+--------+-------------------+
[localhost:21000] > compute stats no_stats;
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 1 column(s). |
+-----------------------------------------+
[localhost:21000] > show table stats no_stats;
+-------+--------+------+--------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+--------+------+--------------+--------+-------------------+
| 0     | 0      | 0B   | NOT CACHED   | TEXT   | false             |
+-------+--------+------+--------------+--------+-------------------+

The following example shows a similar progression with a partitioned table. Initially, #Rows is -1. After a COMPUTE STATS operation, #Rows changes to an accurate value. Any newly added partition starts with no statistics, meaning that you must collect statistics after adding a new partition.

[localhost:21000] > create table no_stats_partitioned (x int) partitioned by (year smallint);
[localhost:21000] > show table stats no_stats_partitioned;
+-------+-------+--------+------+--------------+--------+-------------------+
| year  | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+-------+--------+------+--------------+--------+-------------------+
| Total | -1    | 0      | 0B   | 0B           |        |                   |
+-------+-------+--------+------+--------------+--------+-------------------+
[localhost:21000] > show partitions no_stats_partitioned;
+-------+-------+--------+------+--------------+--------+-------------------+
| year  | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+-------+--------+------+--------------+--------+-------------------+
| Total | -1    | 0      | 0B   | 0B           |        |                   |
+-------+-------+--------+------+--------------+--------+-------------------+
[localhost:21000] > alter table no_stats_partitioned add partition (year=2013);
[localhost:21000] > compute stats no_stats_partitioned;
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 1 column(s). |
+-----------------------------------------+
[localhost:21000] > alter table no_stats_partitioned add partition (year=2014);
[localhost:21000] > show partitions no_stats_partitioned;
+-------+-------+--------+------+--------------+--------+-------------------+
| year  | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+-------+--------+------+--------------+--------+-------------------+
| 2013  | 0     | 0      | 0B   | NOT CACHED   | TEXT   | false             |
| 2014  | -1    | 0      | 0B   | NOT CACHED   | TEXT   | false             |
| Total | 0     | 0      | 0B   | 0B           |        |                   |
+-------+-------+--------+------+--------------+--------+-------------------+
  Note: Because the default COMPUTE STATS statement creates and updates statistics for all partitions in a table, if you expect to frequently add new partitions, use the COMPUTE INCREMENTAL STATS syntax instead, which lets you compute stats for a single specified partition, or only for those partitions that do not already have incremental stats.

If checking each individual table is impractical, due to a large number of tables or views that hide the underlying base tables, you can also check for missing statistics for a particular query. Use the EXPLAIN statement to preview query efficiency before actually running the query. Use the query profile output available through the PROFILE command in impala-shell or the web UI to verify query execution and timing after running the query. Both the EXPLAIN plan and the PROFILE output display a warning if any tables or partitions involved in the query do not have statistics.

[localhost:21000] > create table no_stats (x int);
[localhost:21000] > explain select count(*) from no_stats;
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=10.00MB VCores=1                           |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| incremental_stats.no_stats                                                         |
|                                                                                    |
| 03:AGGREGATE [FINALIZE]                                                            |
| |  output: count:merge(*)                                                          |
| |                                                                                  |
| 02:EXCHANGE [UNPARTITIONED]                                                        |
| |                                                                                  |
| 01:AGGREGATE                                                                       |
| |  output: count(*)                                                                |
| |                                                                                  |
| 00:SCAN HDFS [incremental_stats.no_stats]                                          |
|    partitions=1/1 files=0 size=0B                                                  |
+------------------------------------------------------------------------------------+

Because Impala uses the partition pruning technique when possible to only evaluate certain partitions, if you have a partitioned table with statistics for some partitions and not others, whether or not the EXPLAIN statement shows the warning depends on the actual partitions used by the query. For example, you might see warnings or not for different queries against the same table:

-- No warning because all the partitions for the year 2012 have stats.
EXPLAIN SELECT ... FROM t1 WHERE year = 2012;

-- Missing stats warning because one or more partitions in this range
-- do not have stats.
EXPLAIN SELECT ... FROM t1 WHERE year BETWEEN 2006 AND 2009;

To confirm if any partitions at all in the table are missing statistics, you might explain a query that scans the entire table, such as SELECT COUNT(*) FROM table_name.

Keeping Statistics Up to Date

When the contents of a table or partition change significantly, recompute the stats for the relevant table or partition. The degree of change that qualifies as "significant" varies, depending on the absolute and relative sizes of the tables. Typically, if you add more than 30% more data to a table, it is worthwhile to recompute stats, because the differences in number of rows and number of distinct values might cause Impala to choose a different join order when that table is used in join queries. This guideline is most important for the largest tables. For example, adding 30% new data to a table containing 1 TB has a greater effect on join order than adding 30% to a table containing only a few megabytes, and the larger table has a greater effect on query performance if Impala chooses a suboptimal join order as a result of outdated statistics.

If you reload a complete new set of data for a table, but the number of rows and number of distinct values for each column is relatively unchanged from before, you do not need to recompute stats for the table.

If the statistics for a table are out of date, and the table's large size makes it impractical to recompute new stats immediately, you can use the DROP STATS statement to remove the obsolete statistics, making it easier to identify tables that need a new COMPUTE STATS operation. Hints

For a large partitioned table, consider using the incremental stats feature available in Impala 2.1.0 and higher, as explained in Overview of Incremental Statistics. If you add a new partition to a table, it is worthwhile to recompute incremental stats, because the operation only scans the data for that one new partition.

Setting Statistics Manually through ALTER TABLE

The most crucial piece of data in all the statistics is the number of rows in the table (for an unpartitioned table) or for each partition (for a partitioned table). The COMPUTE STATS statement always gathers statistics about all columns, as well as overall table statistics. If it is not practical to do an entire COMPUTE STATS operation after adding a partition or inserting data, or if you can see that Impala would produce a more efficient plan if the number of rows was different, you can manually set the number of rows through an ALTER TABLE statement:

create table analysis_data stored as parquet as select * from raw_data;
Inserted 1000000000 rows in 181.98s
compute stats analysis_data;
insert into analysis_data select * from smaller_table_we_forgot_before;
Inserted 1000000 rows in 15.32s
-- Now there are 1001000000 rows. We can update this single data point in the stats.
alter table analysis_data set tblproperties('numRows'='1001000000');

For a partitioned table, update both the per-partition number of rows and the number of rows for the whole table:

-- If the table originally contained 1000000 rows, and we add another partition,
-- change the numRows property for the partition and the overall table.
alter table partitioned_data partition(year=2009, month=4) set tblproperties ('numRows'='30000');
alter table partitioned_data set tblproperties ('numRows'='1030000');

In practice, the COMPUTE STATS statement should be fast enough that this technique is not needed. It is most useful as a workaround for in case of performance issues where you might adjust the numRows value higher or lower to produce the ideal join order.

Examples of Using Table and Column Statistics with Impala

The following examples walk through a sequence of SHOW TABLE STATS, SHOW COLUMN STATS, ALTER TABLE, and SELECT and INSERT statements to illustrate various aspects of how Impala uses statistics to help optimize queries.

This example shows table and column statistics for the STORE column used in the TPC-DS benchmarks for decision support systems. It is a tiny table holding data for 12 stores. Initially, before any statistics are gathered by a COMPUTE STATS statement, most of the numeric fields show placeholder values of -1, indicating that the figures are unknown. The figures that are filled in are values that are easily countable or deducible at the physical level, such as the number of files, total data size of the files, and the maximum and average sizes for data types that have a constant size such as INT, FLOAT, and TIMESTAMP.

[localhost:21000] > show table stats store;
+-------+--------+--------+--------+
| #Rows | #Files | Size   | Format |
+-------+--------+--------+--------+
| -1    | 1      | 3.08KB | TEXT   |
+-------+--------+--------+--------+
Returned 1 row(s) in 0.03s
[localhost:21000] > show column stats store;
+--------------------+-----------+------------------+--------+----------+----------+
| Column             | Type      | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------------------+-----------+------------------+--------+----------+----------+
| s_store_sk         | INT       | -1               | -1     | 4        | 4        |
| s_store_id         | STRING    | -1               | -1     | -1       | -1       |
| s_rec_start_date   | TIMESTAMP | -1               | -1     | 16       | 16       |
| s_rec_end_date     | TIMESTAMP | -1               | -1     | 16       | 16       |
| s_closed_date_sk   | INT       | -1               | -1     | 4        | 4        |
| s_store_name       | STRING    | -1               | -1     | -1       | -1       |
| s_number_employees | INT       | -1               | -1     | 4        | 4        |
| s_floor_space      | INT       | -1               | -1     | 4        | 4        |
| s_hours            | STRING    | -1               | -1     | -1       | -1       |
| s_manager          | STRING    | -1               | -1     | -1       | -1       |
| s_market_id        | INT       | -1               | -1     | 4        | 4        |
| s_geography_class  | STRING    | -1               | -1     | -1       | -1       |
| s_market_desc      | STRING    | -1               | -1     | -1       | -1       |
| s_market_manager   | STRING    | -1               | -1     | -1       | -1       |
| s_division_id      | INT       | -1               | -1     | 4        | 4        |
| s_division_name    | STRING    | -1               | -1     | -1       | -1       |
| s_company_id       | INT       | -1               | -1     | 4        | 4        |
| s_company_name     | STRING    | -1               | -1     | -1       | -1       |
| s_street_number    | STRING    | -1               | -1     | -1       | -1       |
| s_street_name      | STRING    | -1               | -1     | -1       | -1       |
| s_street_type      | STRING    | -1               | -1     | -1       | -1       |
| s_suite_number     | STRING    | -1               | -1     | -1       | -1       |
| s_city             | STRING    | -1               | -1     | -1       | -1       |
| s_county           | STRING    | -1               | -1     | -1       | -1       |
| s_state            | STRING    | -1               | -1     | -1       | -1       |
| s_zip              | STRING    | -1               | -1     | -1       | -1       |
| s_country          | STRING    | -1               | -1     | -1       | -1       |
| s_gmt_offset       | FLOAT     | -1               | -1     | 4        | 4        |
| s_tax_precentage   | FLOAT     | -1               | -1     | 4        | 4        |
+--------------------+-----------+------------------+--------+----------+----------+
Returned 29 row(s) in 0.04s

With the Hive ANALYZE TABLE statement for column statistics, you had to specify each column for which to gather statistics. The Impala COMPUTE STATS statement automatically gathers statistics for all columns, because it reads through the entire table relatively quickly and can efficiently compute the values for all the columns. This example shows how after running the COMPUTE STATS statement, statistics are filled in for both the table and all its columns:

[localhost:21000] > compute stats store;
+------------------------------------------+
| summary                                  |
+------------------------------------------+
| Updated 1 partition(s) and 29 column(s). |
+------------------------------------------+
Returned 1 row(s) in 1.88s
[localhost:21000] > show table stats store;
+-------+--------+--------+--------+
| #Rows | #Files | Size   | Format |
+-------+--------+--------+--------+
| 12    | 1      | 3.08KB | TEXT   |
+-------+--------+--------+--------+
Returned 1 row(s) in 0.02s
[localhost:21000] > show column stats store;
+--------------------+-----------+------------------+--------+----------+-------------------+
| Column             | Type      | #Distinct Values | #Nulls | Max Size | Avg Size          |
+--------------------+-----------+------------------+--------+----------+-------------------+
| s_store_sk         | INT       | 12               | -1     | 4        | 4                 |
| s_store_id         | STRING    | 6                | -1     | 16       | 16                |
| s_rec_start_date   | TIMESTAMP | 4                | -1     | 16       | 16                |
| s_rec_end_date     | TIMESTAMP | 3                | -1     | 16       | 16                |
| s_closed_date_sk   | INT       | 3                | -1     | 4        | 4                 |
| s_store_name       | STRING    | 8                | -1     | 5        | 4.25              |
| s_number_employees | INT       | 9                | -1     | 4        | 4                 |
| s_floor_space      | INT       | 10               | -1     | 4        | 4                 |
| s_hours            | STRING    | 2                | -1     | 8        | 7.083300113677979 |
| s_manager          | STRING    | 7                | -1     | 15       | 12                |
| s_market_id        | INT       | 7                | -1     | 4        | 4                 |
| s_geography_class  | STRING    | 1                | -1     | 7        | 7                 |
| s_market_desc      | STRING    | 10               | -1     | 94       | 55.5              |
| s_market_manager   | STRING    | 7                | -1     | 16       | 14                |
| s_division_id      | INT       | 1                | -1     | 4        | 4                 |
| s_division_name    | STRING    | 1                | -1     | 7        | 7                 |
| s_company_id       | INT       | 1                | -1     | 4        | 4                 |
| s_company_name     | STRING    | 1                | -1     | 7        | 7                 |
| s_street_number    | STRING    | 9                | -1     | 3        | 2.833300113677979 |
| s_street_name      | STRING    | 12               | -1     | 11       | 6.583300113677979 |
| s_street_type      | STRING    | 8                | -1     | 9        | 4.833300113677979 |
| s_suite_number     | STRING    | 11               | -1     | 9        | 8.25              |
| s_city             | STRING    | 2                | -1     | 8        | 6.5               |
| s_county           | STRING    | 1                | -1     | 17       | 17                |
| s_state            | STRING    | 1                | -1     | 2        | 2                 |
| s_zip              | STRING    | 2                | -1     | 5        | 5                 |
| s_country          | STRING    | 1                | -1     | 13       | 13                |
| s_gmt_offset       | FLOAT     | 1                | -1     | 4        | 4                 |
| s_tax_precentage   | FLOAT     | 5                | -1     | 4        | 4                 |
+--------------------+-----------+------------------+--------+----------+-------------------+
Returned 29 row(s) in 0.04s

The following example shows how statistics are represented for a partitioned table. In this case, we have set up a table to hold the world's most trivial census data, a single STRING field, partitioned by a YEAR column. The table statistics include a separate entry for each partition, plus final totals for the numeric fields. The column statistics include some easily deducible facts for the partitioning column, such as the number of distinct values (the number of partition subdirectories).

localhost:21000] > describe census;
+------+----------+---------+
| name | type     | comment |
+------+----------+---------+
| name | string   |         |
| year | smallint |         |
+------+----------+---------+
Returned 2 row(s) in 0.02s
[localhost:21000] > show table stats census;
+-------+-------+--------+------+---------+
| year  | #Rows | #Files | Size | Format  |
+-------+-------+--------+------+---------+
| 2000  | -1    | 0      | 0B   | TEXT    |
| 2004  | -1    | 0      | 0B   | TEXT    |
| 2008  | -1    | 0      | 0B   | TEXT    |
| 2010  | -1    | 0      | 0B   | TEXT    |
| 2011  | 0     | 1      | 22B  | TEXT    |
| 2012  | -1    | 1      | 22B  | TEXT    |
| 2013  | -1    | 1      | 231B | PARQUET |
| Total | 0     | 3      | 275B |         |
+-------+-------+--------+------+---------+
Returned 8 row(s) in 0.02s
[localhost:21000] > show column stats census;
+--------+----------+------------------+--------+----------+----------+
| Column | Type     | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+----------+------------------+--------+----------+----------+
| name   | STRING   | -1               | -1     | -1       | -1       |
| year   | SMALLINT | 7                | -1     | 2        | 2        |
+--------+----------+------------------+--------+----------+----------+
Returned 2 row(s) in 0.02s

The following example shows how the statistics are filled in by a COMPUTE STATS statement in Impala.

[localhost:21000] > compute stats census;
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 3 partition(s) and 1 column(s). |
+-----------------------------------------+
Returned 1 row(s) in 2.16s
[localhost:21000] > show table stats census;
+-------+-------+--------+------+---------+
| year  | #Rows | #Files | Size | Format  |
+-------+-------+--------+------+---------+
| 2000  | -1    | 0      | 0B   | TEXT    |
| 2004  | -1    | 0      | 0B   | TEXT    |
| 2008  | -1    | 0      | 0B   | TEXT    |
| 2010  | -1    | 0      | 0B   | TEXT    |
| 2011  | 4     | 1      | 22B  | TEXT    |
| 2012  | 4     | 1      | 22B  | TEXT    |
| 2013  | 1     | 1      | 231B | PARQUET |
| Total | 9     | 3      | 275B |         |
+-------+-------+--------+------+---------+
Returned 8 row(s) in 0.02s
[localhost:21000] > show column stats census;
+--------+----------+------------------+--------+----------+----------+
| Column | Type     | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+----------+------------------+--------+----------+----------+
| name   | STRING   | 4                | -1     | 5        | 4.5      |
| year   | SMALLINT | 7                | -1     | 2        | 2        |
+--------+----------+------------------+--------+----------+----------+
Returned 2 row(s) in 0.02s

For examples showing how some queries work differently when statistics are available, see Examples of Join Order Optimization. You can see how Impala executes a query differently in each case by observing the EXPLAIN output before and after collecting statistics. Measure the before and after query times, and examine the throughput numbers in before and after SUMMARY or PROFILEoutput, to verify how much the improved plan speeds up performance.