To accurately respond to queries, the Impala node that acts as the coordinator (the node to which you are connected through impala-shell, JDBC, or ODBC) must have current metadata about those databases and tables that are referenced in Impala queries. If you are not familiar with the way Impala uses metadata and how it shares the same metastore database as Hive, see Overview of Impala Metadata and the Metastore for background information.
Use the REFRESH statement to load the latest metastore metadata and block location data for a particular table in these scenarios:
- After loading new data files into the HDFS data directory for the table. (Once you have set up an ETL pipeline to bring data into Impala on a regular basis, this is typically the most frequent reason why metadata needs to be refreshed.)
- After issuing ALTER TABLE, INSERT, LOAD DATA, or other table-modifying SQL statement in Hive.
You only need to issue the REFRESH statement on the node to which you connect to issue queries. The coordinator node divides the work among all the Impala nodes in a cluster, and sends read requests for the correct HDFS blocks without relying on the metadata on the other nodes.
REFRESH reloads the metadata for the table from the metastore database, and does an incremental reload of the low-level block location data to account for any new data files added to the HDFS data directory for the table. It is a low-overhead, single-table operation, specifically tuned for the common scenario where new data files are added to HDFS.
Only the metadata for the specified table is flushed. The table must already exist and be known to Impala, either because the CREATE TABLE statement was run in Impala rather than Hive, or because a previous INVALIDATE METADATA statement caused Impala to reload its entire metadata catalog.
In Impala 1.2 and higher, the catalog service broadcasts any changed metadata as a result of Impala ALTER TABLE, INSERT and LOAD DATA statements to all Impala nodes. Thus, the REFRESH statement is only required if you load data through Hive or by manipulating data files in HDFS directly. See The Impala Catalog Service for more information on the catalog service.
In Impala 1.2.1 and higher, another way to avoid inconsistency across nodes is to enable the SYNC_DDL query option before performing a DDL statement or an INSERT or LOAD DATA.
The functionality of the REFRESH statement has changed in Impala 1.1 and higher. Now the table name is a required parameter. To flush the metadata for all tables, use the INVALIDATE METADATA command.
Because REFRESH table_name only works for tables that Impala is already aware of, when you create a new table in the Hive shell, you must enter INVALIDATE METADATA with no table parameter before you can see the new table in impala-shell. Once the table is known to Impala, you can issue REFRESH table_name as needed after you add more data files for that table.
INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA waits to reload the metadata when needed for a subsequent query, but reloads all the metadata for the table, which can be an expensive operation, especially for large tables with many partitions. REFRESH reloads the metadata immediately, but only loads the block location data for newly added data files, making it a less expensive operation overall. If data was altered in some more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, the INVALIDATE METADATA statement works just like the Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is optimized for the common use case of adding new data files to an existing table, thus the table name argument is now required.
A metadata update for an impalad instance is required if:
- A metadata change occurs.
- and the change is made through Hive.
- and the change is made to a database to which clients such as the Impala shell or ODBC directly connect.
A metadata update for an Impala node is not required after you run ALTER TABLE, INSERT, or other table-modifying statement in Impala rather than Hive. Impala handles the metadata synchronization automatically through the catalog service.
Database and table metadata is typically modified by:
- Hive - through ALTER, CREATE, DROP or INSERT operations.
- Impalad - through CREATE TABLE, ALTER TABLE, and INSERT operations. In Impala 1.2 and higher, such changes are propagated to all Impala nodes by the Impala catalog service.
REFRESH causes the metadata for that table to be immediately reloaded. For a huge table, that process could take a noticeable amount of time; but doing the refresh up front avoids an unpredictable delay later, for example if the next reference to the table is during a benchmark test.
If you connect to different Impala nodes within an impala-shell session for load-balancing purposes, you can enable the SYNC_DDL query option to make each DDL statement wait before returning, until the new or changed metadata has been received by all the Impala nodes. See SYNC_DDL Query Option for details.
The following example shows how you might use the REFRESH statement after manually adding new HDFS data files to the Impala data directory for a table:
[impalad-host:21000] > refresh t1; [impalad-host:21000] > refresh t2; [impalad-host:21000] > select * from t1; ... [impalad-host:21000] > select * from t2; ...
For more examples of using REFRESH and INVALIDATE METADATA with a combination of Impala and Hive operations, see Switching Back and Forth Between Impala and Hive.
Related impalad options:
In Impala 1.0, the -r option of impala-shell issued REFRESH to reload metadata for all tables.
In Impala 1.1 and higher, this option issues INVALIDATE METADATA because REFRESH now requires a table name parameter. Due to the expense of reloading the metadata for all tables, the impala-shell -r option is not recommended for day-to-day use in a production environment.
In Impala 1.2 and higher, the -r option is needed even less frequently, because metadata changes caused by SQL statements in Impala are automatically broadcast to all Impala nodes.
The user ID that the impalad daemon runs under, typically the impala user, must have execute permissions for all the relevant directories holding table data. (A table could have data spread across multiple directories, or in unexpected paths, if it uses partitioning or specifies a LOCATION attribute for individual partitions or the entire table.) Issues with permissions might not cause an immediate error for this statement, but subsequent statements such as SELECT or SHOW TABLE STATS could fail.
The REFRESH command checks HDFS permissions of the underlying data files and directories, caching this information so that a statement can be cancelled immediately if for example the impala user does not have permission to write to the data directory for the table. Impala reports any lack of write permissions as an INFO message in the log file, in case that represents an oversight. If you change HDFS permissions to make data readable or writeable by the Impala user, issue another REFRESH to make Impala aware of the change.
Amazon S3 considerations:
The REFRESH and INVALIDATE METADATA statements also cache metadata for tables where the data resides in the Amazon Simple Storage Service (S3). In particular, issue a REFRESH for a table after adding or removing files in the associated S3 data directory. See Using Impala to Query the Amazon S3 Filesystem (Unsupported Preview) for details about working with S3 tables.
Cancellation: Cannot be cancelled.
|<< LOAD DATA Statement||REVOKE Statement (CDH 5.2 or higher only) >>|