# COMPUTE STATS Statement

Gathers information about volume and distribution of data in a table and all associated columns and partitions. The information is stored in the metastore database, and used by Impala to help optimize queries. For example, if Impala can determine that a table is large or small, or has many or few distinct values it can organize parallelize the work appropriately for a join query or insert operation. For details about the kinds of information gathered by this statement, see Table Statistics.

**Note**

`COMPUTE STATS`counted the number of

`NULL`values in each column and recorded that figure in the metastore database. Because Impala does not currently make use of the

`NULL`count during query planning, Impala 1.4.0 and higher speeds up the

`COMPUTE STATS`statement by skipping this

`NULL`counting.

**Statement type:** DDL

**Usage notes:**

Originally, Impala relied on users to run the Hive `ANALYZE
TABLE` statement, but that method of gathering statistics proved unreliable and difficult to use. The Impala
`COMPUTE STATS` statement is built from the ground up to improve
the reliability and user-friendliness of this operation. `COMPUTE STATS` does not require any
setup steps or special configuration.
You only run a single Impala `COMPUTE STATS` statement
to gather both table and column statistics, rather than separate Hive
`ANALYZE TABLE` statements for each kind of statistics.

**Note**

`COMPUTE STATS`is an important step at the end of your ETL process. Run

`COMPUTE STATS`on all tables as your first step during performance tuning for slow queries, or troubleshooting for out-of-memory conditions:

- Accurate statistics help Impala construct an efficient query plan for join queries, improving performance and reducing memory usage.
- Accurate statistics help Impala distribute the work effectively for insert operations into Parquet tables, improving performance and reducing memory usage.
- Accurate statistics help Impala estimate the memory required for each query, which is important when you use resource management features, such as admission control and the YARN resource management framework. The statistics help Impala to achieve high concurrency, full utilization of available memory, and avoid contention with workloads from other Hadoop components.

For related information, see SHOW Statement, Table Statistics, and Column Statistics.

**HBase considerations:**

`COMPUTE STATS` works for HBase tables also. The statistics gathered for HBase tables
are somewhat different than for HDFS-backed tables, but that metadata is still used for optimization
when HBase tables are involved in join queries.

**Performance considerations:**
The statistics collected by `COMPUTE STATS` are used to optimize join queries
and resource-intensive `INSERT` operations.

**Examples:**

This example shows two tables, `T1` and `T2`,
with a small number distinct values linked by a parent-child relationship
between `T1.ID` and `T2.PARENT`.
`T1` is tiny, while `T2` has approximately 100K rows.
Initially, the statistics includes physical measurements
such as the number of files, the total size, and size measurements
for fixed-length columns such as with the `INT` type.
Unknown values are represented by -1.
After running `COMPUTE STATS` for each table,
much more information is available through the `SHOW STATS`
statements.
If you were running a join query involving both of these tables,
you would need statistics for both tables to get the most
effective optimization for the query.

[localhost:21000] > show table stats t1; Query: show table stats t1 +-------+--------+------+--------+ | #Rows | #Files | Size | Format | +-------+--------+------+--------+ | -1 | 1 | 33B | TEXT | +-------+--------+------+--------+ Returned 1 row(s) in 0.02s [localhost:21000] > show table stats t2; Query: show table stats t2 +-------+--------+----------+--------+ | #Rows | #Files | Size | Format | +-------+--------+----------+--------+ | -1 | 28 | 960.00KB | TEXT | +-------+--------+----------+--------+ Returned 1 row(s) in 0.01s [localhost:21000] > show column stats t1; Query: show column stats t1 +--------+--------+------------------+--------+----------+----------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +--------+--------+------------------+--------+----------+----------+ | id | INT | -1 | -1 | 4 | 4 | | s | STRING | -1 | -1 | -1 | -1 | +--------+--------+------------------+--------+----------+----------+ Returned 2 row(s) in 1.71s [localhost:21000] > show column stats t2; Query: show column stats t2 +--------+--------+------------------+--------+----------+----------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +--------+--------+------------------+--------+----------+----------+ | parent | INT | -1 | -1 | 4 | 4 | | s | STRING | -1 | -1 | -1 | -1 | +--------+--------+------------------+--------+----------+----------+ Returned 2 row(s) in 0.01s [localhost:21000] > compute stats t1; Query: compute stats t1 +-----------------------------------------+ | summary | +-----------------------------------------+ | Updated 1 partition(s) and 2 column(s). | +-----------------------------------------+ Returned 1 row(s) in 5.30s [localhost:21000] > show table stats t1; Query: show table stats t1 +-------+--------+------+--------+ | #Rows | #Files | Size | Format | +-------+--------+------+--------+ | 3 | 1 | 33B | TEXT | +-------+--------+------+--------+ Returned 1 row(s) in 0.01s [localhost:21000] > show column stats t1; Query: show column stats t1 +--------+--------+------------------+--------+----------+----------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +--------+--------+------------------+--------+----------+----------+ | id | INT | 3 | -1 | 4 | 4 | | s | STRING | 3 | -1 | -1 | -1 | +--------+--------+------------------+--------+----------+----------+ Returned 2 row(s) in 0.02s [localhost:21000] > compute stats t2; Query: compute stats t2 +-----------------------------------------+ | summary | +-----------------------------------------+ | Updated 1 partition(s) and 2 column(s). | +-----------------------------------------+ Returned 1 row(s) in 5.70s [localhost:21000] > show table stats t2; Query: show table stats t2 +-------+--------+----------+--------+ | #Rows | #Files | Size | Format | +-------+--------+----------+--------+ | 98304 | 1 | 960.00KB | TEXT | +-------+--------+----------+--------+ Returned 1 row(s) in 0.03s [localhost:21000] > show column stats t2; Query: show column stats t2 +--------+--------+------------------+--------+----------+----------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +--------+--------+------------------+--------+----------+----------+ | parent | INT | 3 | -1 | 4 | 4 | | s | STRING | 6 | -1 | -1 | -1 | +--------+--------+------------------+--------+----------+----------+ Returned 2 row(s) in 0.01s

**File format considerations:**

The `COMPUTE STATS` statement works with tables created with any of the
file formats supported by Impala. See How Impala Works with Hadoop File Formats
for details about working with the different file formats.
The following considerations apply to `COMPUTE STATS` depending
on the file format of the table.

The `COMPUTE STATS` statement works with text tables with no restrictions.
These tables can be created through either Impala or Hive.

`COMPUTE STATS`statement works with Parquet tables. These tables can be created through either Impala or Hive.

**Note**

`COMPUTE STATS`operation on a Parquet table. As a workaround, issue the command

`SET NUM_SCANNER_THREADS=2`in impala-shell before issuing the

`COMPUTE STATS`statement. Then issue

`UNSET NUM_SCANNER_THREADS`before continuing with queries.

The `COMPUTE STATS` statement works with Avro tables, as long as they
are created with SQL-style column names and types rather than an Avro-style schema
specification. These tables are currently always created through Hive rather than Impala.

The `COMPUTE STATS` statement works with RCFile tables with no restrictions.
These tables can be created through either Impala or Hive.

The `COMPUTE STATS` statement works with SequenceFile tables with no restrictions.
These tables can be created through either Impala or Hive.

The `COMPUTE STATS` statement works with partitioned tables, whether all the
partitions use the same file format, or some partitions are defined through `ALTER TABLE`
to use different file formats.

**Cancellation:** Certain multi-stage statements (`CREATE TABLE AS SELECT` and `COMPUTE STATS`) can be cancelled during some stages, when running `INSERT` or `SELECT` operations internally. To cancel this statement...

**Restrictions:**

Currently, the `COMPUTE STATS` statement under CDH 4 does not store any statistics for
`DECIMAL` columns.
When Impala runs under CDH 5, which has better support for `DECIMAL`
in the metastore database, `COMPUTE STATS` does collect
statistics for `DECIMAL` columns and Impala uses the
statistics to optimize query performance.

<< | ||

Terms and Conditions Privacy Policy |