Overview of Impala Tables

Tables are the primary containers for data in Impala. They have the familiar row and column layout similar to other database systems, plus some features such as partitioning often associated with higher-end data warehouse systems.

Logically, each table has a structure based on the definition of its columns, partitions, and other properties.

Physically, each table that uses HDFS storage is associated with a directory in HDFS. The table data consists of all the data files underneath that directory:

  • Internal tables are managed by Impala, and use directories inside the designated Impala work area.
  • External tables use arbitrary HDFS directories, where the data files are typically shared between different Hadoop components.
  • Large-scale data is usually handled by partitioned tables, where the data files are divided among different HDFS subdirectories.

Impala tables can also represent data that is stored in HBase, or in the Amazon S3 filesystem (CDH 5.4.0 or higher), or on Isilon storage devices (CDH 5.4.3 or higher). See Using Impala to Query HBase Tables, Using Impala to Query the Amazon S3 Filesystem, and Using Impala with Isilon Storage for details about those special kinds of tables.

Impala queries ignore files with extensions commonly used for temporary work files by Hadoop tools. Any files with extensions .tmp or .copying are not considered part of the Impala table. The suffix matching is case-insensitive, so for example Impala ignores both .copying and .COPYING suffixes.

Related statements: CREATE TABLE Statement, DROP TABLE Statement, ALTER TABLE Statement INSERT Statement, LOAD DATA Statement, SELECT Statement

Internal Tables

The default kind of table produced by the CREATE TABLE statement is known as an internal table. (Its counterpart is the external table, produced by the CREATE EXTERNAL TABLE syntax.)

  • Impala creates a directory in HDFS to hold the data files.

  • You can create data in internal tables by issuing INSERT or LOAD DATA statements.

  • If you add or replace data using HDFS operations, issue the REFRESH command in impala-shell so that Impala recognizes the changes in data files, block locations, and so on.

  • When you issue a DROP TABLE statement, Impala physically removes all the data files from the directory.

  • To see whether a table is internal or external, and its associated HDFS location, issue the statement DESCRIBE FORMATTED table_name. The Table Type field displays MANAGED_TABLE for internal tables and EXTERNAL_TABLE for external tables. The Location field displays the path of the table directory as an HDFS URI.

  • When you issue an ALTER TABLE statement to rename an internal table, all data files are moved into the new HDFS directory for the table. The files are moved even if they were formerly in a directory outside the Impala data directory, for example in an internal table with a LOCATION attribute pointing to an outside HDFS directory.

Examples:

You can switch a table from internal to external, or from external to internal, by using the ALTER TABLE statement:
-- Switch a table from internal to external.
ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');

-- Switch a table from external to internal.
ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='FALSE');

Related information:

External Tables, CREATE TABLE Statement, DROP TABLE Statement, ALTER TABLE Statement, DESCRIBE Statement

External Tables

The syntax CREATE EXTERNAL TABLE sets up an Impala table that points at existing data files, potentially in HDFS locations outside the normal Impala data directories.. This operation saves the expense of importing the data into a new table when you already have the data files in a known location in HDFS, in the desired file format.

  • You can use Impala to query the data in this table.

  • You can create data in external tables by issuing INSERT or LOAD DATA statements.

  • If you add or replace data using HDFS operations, issue the REFRESH command in impala-shell so that Impala recognizes the changes in data files, block locations, and so on.

  • When you issue a DROP TABLE statement in Impala, that removes the connection that Impala has with the associated data files, but does not physically remove the underlying data. You can continue to use the data files with other Hadoop components and HDFS operations.

  • To see whether a table is internal or external, and its associated HDFS location, issue the statement DESCRIBE FORMATTED table_name. The Table Type field displays MANAGED_TABLE for internal tables and EXTERNAL_TABLE for external tables. The Location field displays the path of the table directory as an HDFS URI.

  • When you issue an ALTER TABLE statement to rename an external table, all data files are left in their original locations.

  • You can point multiple external tables at the same HDFS directory by using the same LOCATION attribute for each one. The tables could have different column definitions, as long as the number and types of columns are compatible with the schema evolution considerations for the underlying file type. For example, for text data files, one table might define a certain column as a STRING while another defines the same column as a BIGINT.

Examples:

You can switch a table from internal to external, or from external to internal, by using the ALTER TABLE statement:
-- Switch a table from internal to external.
ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');

-- Switch a table from external to internal.
ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='FALSE');

Related information:

Internal Tables, CREATE TABLE Statement, DROP TABLE Statement, ALTER TABLE Statement, DESCRIBE Statement

File Formats

Each table has an associated file format, which determines how Impala interprets the associated data files. See How Impala Works with Hadoop File Formats for details.

You set the file format during the CREATE TABLE statement, or change it later using the ALTER TABLE statement. Partitioned tables can have a different file format for individual partitions, allowing you to change the file format used in your ETL process for new data without going back and reconverting all the existing data in the same table.

Any INSERT statements produce new data files with the current file format of the table. For existing data files, changing the file format of the table does not automatically do any data conversion. You must use TRUNCATE TABLE or INSERT OVERWRITE to remove any previous data files that use the old file format. Then you use the LOAD DATA statement, INSERT ... SELECT, or other mechanism to put data files of the correct format into the table.

The default file format, text, is the most flexible and easy to produce when you are just getting started with Impala. The Parquet file format offers the highest query performance and uses compression to reduce storage requirements; therefore, Cloudera recommends using Parquet for Impala tables with substantial amounts of data. Also, the complex types (ARRAY, STRUCT, and MAP) available in CDH 5.5 / Impala 2.3 and higher are currently only supported with the Parquet file type. Based on your existing ETL workflow, you might use other file formats such as Avro, possibly doing a final conversion step to Parquet to take advantage of its performance for analytic queries.