Hive Table Statistics

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

Statistics for Hive can be numbers of rows of tables or partitions and the histograms of interesting columns. Statistics are used by the cost functions of the query optimizer to generate query plans for the purpose of query optimization.

If your cluster has Impala then you can use the Impala implementation to compute statistics. The Impala implementation to compute table statistics is available in CDH 5.0.0 or higher and in Impala version 1.2.2 or higher. The Impala implementation of COMPUTE STATS requires no setup steps and is preferred over the Hive implementation. See Overview of Table Statistics. If you are running an older version of Impala, you can collect statistics on a Hive table by running the following command from a Beeline client connected to HiveServer2:
analyze table <table name> compute statistics;
analyze table <table name> compute statistics for columns <all columns of a table>;