This is the documentation for CDH 5.1.x. Documentation for other versions is available at Cloudera Documentation.

Tables

Tables are the primary containers for data in Impala. They have the familiar row and column layout similar to other database systems, plus some features such as partitioning often associated with higher-end data warehouse systems.

Logically, each table has a structure based on the definition of its columns, partitions, and other properties.

Physically, each table is associated with a directory in HDFS. The table data consists of all the data files underneath that directory:

  • Internal tables, managed by Impala, use directories inside the designated Impala work area.
  • External tables use arbitrary HDFS directories, where the data files are typically shared between different Hadoop components.
  • Large-scale data is usually handled by partitioned tables, where the data files are divided among different HDFS subdirectories.

Related statements: CREATE TABLE Statement, DROP TABLE Statement, ALTER TABLE Statement INSERT Statement, LOAD DATA Statement, SELECT Statement

Internal Tables

The default kind of table produced by the CREATE TABLE statement is known as an internal table. (Its counterpart is the external table, produced by the CREATE EXTERNAL TABLE syntax.)

  • Impala creates a directory in HDFS to hold the data files.
  • You load data by issuing INSERT statements in impala-shell or by using the LOAD DATA statement in Hive.
  • When you issue a DROP TABLE statement, Impala physically removes all the data files from the directory.

External Tables

The syntax CREATE EXTERNAL TABLE sets up an Impala table that points at existing data files, potentially in HDFS locations outside the normal Impala data directories.. This operation saves the expense of importing the data into a new table when you already have the data files in a known location in HDFS, in the desired file format.

  • You can use Impala to query the data in this table.
  • If you add or replace data using HDFS operations, issue the REFRESH command in impala-shell so that Impala recognizes the changes in data files, block locations, and so on.
  • When you issue a DROP TABLE statement in Impala, that removes the connection that Impala has with the associated data files, but does not physically remove the underlying data. You can continue to use the data files with other Hadoop components and HDFS operations.
Page generated September 3, 2015.