Impala SQL Language Elements

The Impala SQL dialect supports a range of standard elements, plus some extensions for Big Data use cases related to data loading and data warehousing.

  Note:

In early Impala beta releases, a semicolon was optional at the end of each statement in the impala-shell interpreter. Now that the impala-shell interpreter supports multi-line commands, making it easy to copy and paste code from script files, the semicolon at the end of each statement is required.

Here are the major statements, clauses, and other SQL language elements that you work with in Impala:

ALTER TABLE Statement

The ALTER TABLE statement changes the structure or properties of an existing table. In Impala, this is a logical operation that updates the table metadata in the metastore database that Impala shares with Hive; ALTER TABLE does not actually rewrite, move, and so on the actual data files. Thus, you might need to perform corresponding physical filesystem operations, such as moving data files to a different HDFS directory, rewriting the data files to include extra fields, or converting them to a different file format.

To rename a table:

ALTER TABLE old_name RENAME TO new_name;

For internal tables, his operation physically renames the directory within HDFS that contains the data files; the original directory name no longer exists. By qualifying the table names with database names, you can use this technique to move an internal table (and its associated data directory) from one database to another. For example:

create database d1;
create database d2;
create database d3;
use d1;
create table mobile (x int);
use d2;
-- Move table from another database to the current one.
alter table d1.mobile rename to mobile;
use d1;
-- Move table from one database to another.
alter table d2.mobile rename to d3.mobile;

To change the physical location where Impala looks for data files associated with a table:

ALTER TABLE table_name SET LOCATION 'hdfs_path_of_directory';

The path you specify is the full HDFS path where the data files reside, or will be created. Impala does not create any additional subdirectory named after the table. Impala does not move any data files to this new location or change any data files that might already exist in that directory.

To reorganize columns for a table:

ALTER TABLE table_name ADD COLUMNS (column_defs);
ALTER TABLE table_name REPLACE COLUMNS (column_defs);
ALTER TABLE table_name CHANGE column_name new_name new_spec;
ALTER TABLE table_name DROP column_name;

The column_spec is the same as in the CREATE TABLE statement: the column name, then its data type, then an optional comment. You can add multiple columns at a time. The parentheses are required whether you add a single column or multiple columns. When you replace columns, all the original column definitions are discarded. You might use this technique if you receive a new set of data files with different data types or columns in a different order. (The data files are retained, so if the new columns are incompatible with the old ones, use INSERT OVERWRITE or LOAD DATA OVERWRITE to replace all the data before issuing any further queries.)

You might use the CHANGE clause to rename a single column, or to treat an existing column as a different type than before, such as to switch between treating a column as STRING and TIMESTAMP, or between INT and BIGINT. You can only drop a single column at a time; to drop multiple columns, issue multiple ALTER TABLE statements, or define the new set of columns with a single ALTER TABLE ... REPLACE COLUMNS statement.

To change the file format that Impala expects table data to be in:

ALTER TABLE table_name SET FILEFORMAT { PARQUETFILE | RCFILE | SEQUENCEFILE | TEXTFILE }

Because this operation only changes the table metadata, you must do any conversion of existing data using regular Hadoop techniques outside of Impala. Any new data created by the Impala INSERT statement will be in the new format. You cannot specify the delimiter for Text files; the data files must be comma-delimited. Although Impala can read Avro tables created through Hive, you cannot specify the Avro file format in an Impala ALTER TABLE statement.

To add or drop partitions for a table, the table must already be partitioned (that is, created with a PARTITIONED BY clause). The partition is a physical directory in HDFS, with a name that encodes a particular column value (the partition key). The Impala INSERT statement already creates the partition if necessary, so the ALTER TABLE ... ADD PARTITION is primarily useful for importing data by moving or copying existing data files into the HDFS directory corresponding to a partition. The DROP PARTITION clause is used to remove the HDFS directory and associated data files for a particular set of partition key values; for example, if you always analyze the last 3 months worth of data, at the beginning of each month you might drop the oldest partition that is no longer needed. Removing partitions reduces the amount of metadata associated with the table and the complexity of calculating the optimal query plan, which can simplify and speed up queries on partitioned tables, particularly join queries. Here is an example showing the ADD PARTITION and DROP PARTITION clauses.

-- Create an empty table and define the partitioning scheme.
create table part_t (x int) partitioned by (month string);
-- Create an empty partition into which we could copy data files from some other source.
alter table part_t add partition (month='January');
-- After changing the underlying data, issue a REFRESH statement to make the data visible in Impala.
refresh part_t;
-- Later, do the same for the next month.
alter table part_t add partition (month='February');
-- Now we no longer need the older data.
alter table part_t drop partition (month='January');
-- If the table was partitioned by month and year, we would issue a statement like:
-- alter table part_t drop partition (year=2003,month='January');
-- which would require 12 ALTER TABLE statements to remove a year's worth of data.

The value specified for a partition key can be an arbitrary constant expression, without any references to columns. For example:

alter table time_data add partition (month=concat('Decem','ber'));
alter table sales_data add partition (zipcode = cast(9021 * 10 as string));

Whenever you specify partitions in an ALTER TABLE statement, you must include all the partitioning columns in the specification.

Most of the preceding operations work the same for internal tables (managed by Impala) as for external tables (with data files located in arbitrary locations). The exception is renaming a table; for an external table, the underlying data directory is not renamed or moved.

  Note:

An alternative way to reorganize a table and its associated data files is to use CREATE TABLE to create a variation of the original table, then use INSERT to copy the transformed or reordered data to the new table. The advantage of ALTER TABLE is that it avoids making a duplicate copy of the data files, allowing you to reorganize huge volumes of data in a space-efficient way using familiar Hadoop techniques.

ALTER VIEW Statement

Changes the query associated with a view, or the associated database and/or name of the view.

Because a view is purely a logical construct (an alias for a query) with no physical data behind it, ALTER VIEW only involves changes to metadata in the metastore database, not any data files in HDFS.

ALTER VIEW [database_name.]view_name AS select_statement
ALTER VIEW [database_name.]view_name RENAME TO [database_name.]view_name

Examples:

create table t1 (x int, y int, s string);
create table t2 like t2;
create view v1 as select * from t1;
alter view v1 as select * from t2;
alter view v1 as select x, upper(s) from t2;
To see the definition of a view, issue a DESCRIBE FORMATTED statement, which shows the query from the original CREATE VIEW statement:
[localhost:21000] > create view v1 as select * from t1;
[localhost:21000] > describe formatted v1;
Query finished, fetching results ...
+------------------------------+------------------------------+----------------------+
| name                         | type                         | comment              |
+------------------------------+------------------------------+----------------------+
| # col_name                   | data_type                    | comment              |
|                              | NULL                         | NULL                 |
| x                            | int                          | None                 |
| y                            | int                          | None                 |
| s                            | string                       | None                 |
|                              | NULL                         | NULL                 |
| # Detailed Table Information | NULL                         | NULL                 |
| Database:                    | views                        | NULL                 |
| Owner:                       | cloudera                     | NULL                 |
| CreateTime:                  | Mon Jul 08 15:56:27 EDT 2013 | NULL                 |
| LastAccessTime:              | UNKNOWN                      | NULL                 |
| Protect Mode:                | None                         | NULL                 |
| Retention:                   | 0                            | NULL                 |
| Table Type:                  | VIRTUAL_VIEW                 | NULL                 |
| Table Parameters:            | NULL                         | NULL                 |
|                              | transient_lastDdlTime        | 1373313387           |
|                              | NULL                         | NULL                 |
| # Storage Information        | NULL                         | NULL                 |
| SerDe Library:               | null                         | NULL                 |
| InputFormat:                 | null                         | NULL                 |
| OutputFormat:                | null                         | NULL                 |
| Compressed:                  | No                           | NULL                 |
| Num Buckets:                 | 0                            | NULL                 |
| Bucket Columns:              | []                           | NULL                 |
| Sort Columns:                | []                           | NULL                 |
|                              | NULL                         | NULL                 |
| # View Information           | NULL                         | NULL                 |
| View Original Text:          | SELECT * FROM t1             | NULL                 |
| View Expanded Text:          | SELECT * FROM t1             | NULL                 |
+------------------------------+------------------------------+----------------------+
Returned 29 row(s) in 0.05s

AVG Function

An aggregation function that returns the average value from a set of numbers. Its single argument can be numeric column, or the numeric result of a function or expression applied to the column value. Rows with a NULL value for the specified column are ignored. If the table is empty, or all the values supplied to AVG are NULL, AVG returns NULL.

When the query contains a GROUP BY clause, returns one value for each combination of grouping values.

Return type: DOUBLE

Examples:

-- Average all the non-NULL values in a column.
insert overwrite avg_t values (2),(4),(6),(null),(null);
-- The average of the above values is 4: (2+4+6) / 3. The 2 NULL values are ignored.
select avg(x) from avg_t;
-- Average only certain values from the column.
select avg(x) from t1 where month = 'January' and year = '2013';
-- Apply a calculation to the value of the column before averaging.
select avg(x/3) from t1;
-- Apply a function to the value of the column before averaging.
-- Here we are substituting a value of 0 for all NULLs in the column,
-- so that those rows do factor into the return value.
select avg(isnull(x,0)) from t1;
-- Apply some number-returning function to a string column and average the results.
-- If column s contains any NULLs, length(s) also returns NULL and those rows are ignored.
select avg(length(s)) from t1;
-- Can also be used in combination with DISTINCT and/or GROUP BY.
-- Return more than one result.
select month, year, avg(page_visits) from web_stats group by month, year;
-- Filter the input to eliminate duplicates before performing the calculation.
select avg(distinct x) from t1;
-- Filter the output after performing the calculation.
select avg(x) from t1 group by y having avg(x) between 1 and 20;

BETWEEN Operator

In a WHERE clause, compares an expression to both a lower and upper bound. The comparison is successful is the expression is greater than or equal to the lower bound, and less than or equal to the upper bound. If the bound values are switched, so the lower bound is greater than the upper bound, does not match any values.

Syntax: expression BETWEEN lower_bound AND upper_bound

Data types: Typically used with numeric data types. Works with any data type, although not very practical for BOOLEAN values. (BETWEEN false AND true will match all BOOLEAN values.) Use CAST() if necessary to ensure the lower and upper bound values are compatible types. Call string or date/time functions if necessary to extract or transform the relevant portion to compare, especially if the value can be transformed into a number.

Usage notes: Be careful when using short string operands. A longer string that starts with the upper bound value will not be included, because it is considered greater than the upper bound. For example, BETWEEN 'A' and 'M' would not match the string value 'Midway'. Use functions such as upper(), lower(), substr(), trim(), and so on if necessary to ensure the comparison

Examples:

-- Retrieve data for January through June, inclusive.
select c1 from t1 where month between 1 and 6;

-- Retrieve data for names beginning with 'A' through 'M' inclusive.
-- Only test the first letter to ensure all the values starting with 'M' are matched.
-- Do a case-insensitive comparison to match names with various capitalization conventions.
select last_name from customers where upper(substr(last_name,1,1)) between 'A' and 'M';

-- Retrieve data for only the first week of each month.
select count(distinct visitor_id)) from web_traffic where dayofmonth(when_viewed) between 1 and 7;

BIGINT Data Type

An 8-byte integer data type used in CREATE TABLE and ALTER TABLE statements.

Range: -9223372036854775808 .. 9223372036854775807. There is no UNSIGNED subtype.

Conversions: Impala automatically converts to a floating-point type (FLOAT or DOUBLE) automatically. Use CAST() to convert to TINYINT, SMALLINT, INT, STRING, or TIMESTAMP. Casting an integer value N to TIMESTAMP produces a value that is N seconds past the start of the epoch date (January 1, 1970).

Related information: Mathematical Functions

BOOLEAN Data Type

A data type used in CREATE TABLE and ALTER TABLE statements, representing a single true/false choice.

Range: TRUE or FALSE. Do not use quotation marks around the TRUE and FALSE literal values. You can write the literal values in uppercase, lowercase, or mixed case. The values queried from a table are always returned in lowercase, true or false.

Conversions: Impala does not automatically convert any other type to BOOLEAN. You can use CAST() to convert any integer or float-point type to BOOLEAN: a value of 0 represents false, and any non-zero value is converted to true. You cannot cast a STRING value to BOOLEAN, although you can cast a BOOLEAN value to STRING, returning '1' for true values and '0' for false values.

Related information: Conditional Functions

Comments

Impala supports the familiar styles of SQL comments:

  • All text from a -- sequence to the end of the line is considered a comment and ignored. This type of comment can occur on a single line by itself, or after all or part of a statement.
  • All text from a /* sequence to the next */ sequence is considered a comment and ignored. This type of comment can stretch over multiple lines. This type of comment can occur on one or more lines by itself, in the middle of a statement, or before or after a statement.

For example:

-- This line is a comment about a table.
create table ...;

/*
This is a multi-line comment about a query.
*/
select ...;

select * from t /* This is an embedded comment about a query. */ where ...;

select * from t -- This is a trailing comment within a multi-line command.
  where ...;

COUNT Function

An aggregation function that returns the number of rows, or the number of non-NULL rows, that meet certain conditions. The notation COUNT(*) includes NULL values in the total. The notation COUNT(column_name) only considers rows where the column contains a non-NULL value. You can also combine COUNT with the DISTINCT operator to eliminate duplicates before counting, and to count the combinations of values across multiple columns.

When the query contains a GROUP BY clause, returns one value for each combination of grouping values.

Return type: BIGINT

Examples:

-- How many rows total are in the table, regardless of NULL values?
select count(*) from t1;
-- How many rows are in the table with non-NULL values for a column?
select count(c1) from t1;
-- Count the rows that meet certain conditions.
-- Again, * includes NULLs, so COUNT(*) might be greater than COUNT(col).
select count(*) from t1 where x > 10;
select count(c1) from t1 where x > 10;
-- Can also be used in combination with DISTINCT and/or GROUP BY.
-- Combine COUNT and DISTINCT to find the number of unique values.
-- Must use column names rather than * with COUNT(DISTINCT ...) syntax.
-- Rows with NULL values are not counted.
select count(distinct c1) from t1;
-- Rows with a NULL value in _either_ column are not counted.
select count(distinct c1, c2) from t1;
-- Return more than one result.
select month, year, count(distinct visitor_id) from web_stats group by month, year;

CREATE DATABASE Statement

In Impala, a database is both:

  • A logical construct for grouping together related tables within their own namespace. You might use a separate database for each application, set of related tables, or round of experimentation.
  • A physical construct represented by a directory tree in HDFS. Tables (internal tables), partitions, and data files are all located under this directory. You can back it up, measure space usage, or remove it (if it is empty) with a DROP DATABASE statement.

The syntax for creating a database is as follows:

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name[COMMENT 'database_comment']
  [LOCATION hdfs_path];

A database is physically represented as a directory in HDFS, with a filename extension .db, under the main Impala data directory. If the associated HDFS directory does not exist, it is created for you. All databases and their associated directories are top-level objects, with no physical or logical nesting.

After creating a database, to make it the current database within an impala-shell session, use the USE statement. You can refer to tables in the current database without prepending any qualifier to their names.

When you first connect to Impala through impala-shell, the database you start in (before issuing any CREATE DATABASE or USE statements) is named default.

After creating a database, your impala-shell session or another impala-shell connected to the same node can immediately access that database. To access the database through the Impala daemon on a different node, issue the INVALIDATE METADATA statement first while connected to that other node.

Examples:

create database first;
use first;
create table t1 (x int);

create database second;
use second;
-- Each database has its own namespace for tables.
-- You can reuse the same table names in each database.
create table t1 (s string);

create database temp;
-- You do not have to USE a database after creating it.
-- Just qualify the table name with the name of the database.
create table temp.t2 (x int, y int);
use database temp;
create table t3 (s string);
-- You cannot drop a database while it is selected by the USE statement.
drop database temp;
ERROR: AnalysisException: Cannot drop current default database: temp
-- The always-available database 'default' is a convenient one to USE.
use default;
-- Dropping the database is a fast way to drop all the tables within it.
drop database temp;

CREATE TABLE Statement

The syntax for creating a table is as follows:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name data_type [COMMENT 'col_comment'], ...)]
  [COMMENT 'table_comment']
  [PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
  [
   [ROW FORMAT row_format] [STORED AS file_format]
  ]
  [LOCATION 'hdfs_path']

data_type
  : primitive_type

primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | STRING
  | TIMESTAMP

row_format
  : DELIMITED [FIELDS TERMINATED BY 'char' [ESCAPED BY 'char']]
    [LINES TERMINATED BY 'char']

file_format:
  : PARQUETFILE
  | SEQUENCEFILE
  | TEXTFILE
  | RCFILE

By default, Impala creates an "internal" table, where Impala manages the underlying data files for the table, and physically deletes the data files when you drop the table. If you specify the EXTERNAL clause, Impala treats the table as an "external" table, where the data files are typically produced outside Impala, and Impala leaves the data files in place when you drop the table.

The PARTITIONED BY clause divides the data files based on the values from one or more specified columns. Impala queries can use the partition metadata to minimize the amount of data that is read from disk or transmitted across the network, particularly during join queries. For details about partitioning, see Partitioning.

The STORED AS clause identifies the format of the underlying data files. Currently, Impala can query more types of file formats than it can create or insert into. Use Hive to perform any create or data load operations that are not currently available in Impala. For example, Impala can query an Avro table created by Hive but cannot create an Avro table; Impala can create a SequenceFile table but cannot insert data into it. There are also Impala-specific procedures for using compression with each kind of file format. For details about working with data files of various formats, see How Impala Works with Hadoop File Formats.

By default (when no STORED AS clause is specified), Impala tables are created as Text files with Ctrl-A characters as the delimiter. Specify the ROW FORMAT clause to produce or ingest data files that use a different delimiter character such as tab or |, or a different line end character such as carriage return or linefeed. When specifying delimiter and line end characters, use '\t' for tab, '\n' for carriage return, and '\r' for linefeed.

The ESCAPED BY clause applies both to text files that you create through an INSERT statement to an Impala TEXTFILE table, and to existing data files that you put into an Impala table directory. (You can ingest existing data files either by creating the table with CREATE EXTERNAL TABLE ... LOCATION, the LOAD DATA statement, or through an HDFS operation such as hdfs dfs -put file hdfs_path.) Choose an escape character that is not used anywhere else in the file, and put it in front of each instance of the delimiter character that occurs within a field value. Surrounding field values with quotation marks does not help Impala to parse fields with embedded delimiter characters; the quotation marks are considered to be part of the column value. If you want to use \ as the escape character, specify the clause in impala-shell as ESCAPED BY '\\'.

To create a table with the same columns, comments, and other attributes as another table, use the following variation. The CREATE TABLE ... LIKE form allows a restricted set of clauses, currently only the LOCATION, COMMENT, and STORED AS clauses. Because CREATE TABLE ... LIKE only manipulates table metadata, not the physical data of the table, issue INSERT INTO TABLE statements afterward to copy any data from the original table into the new one, optionally converting the data to a new file format. (For some file formats, Impala can do a CREATE TABLE ... LIKE to create the table, but Impala cannot insert data in that file format; in these cases, you must load the data in Hive. See How Impala Works with Hadoop File Formats for details.)

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE [db_name.]table_name
  [COMMENT 'table_comment']
  [STORED AS file_format]
  [LOCATION 'hdfs_path']

To see the column definitions and column comments for an existing table, issue the statement DESCRIBE table_name. To see even more detail, such as the location of data files and the values for clauses such as ROW FORMAT and STORED AS, issue the statement DESCRIBE FORMATTED table_name.

After creating a table, your impala-shell session or another impala-shell connected to the same node can immediately query that table. To query the table through the Impala daemon on a different node, issue the INVALIDATE METADATA statement first while connected to that other node.

Impala queries can make use of metadata about the table and columns, such as the number of rows in a table or the number of different values in a column. Currently, to create this metadata, you issue the ANALYZE TABLE statement in Hive to gather this information, after creating the table and loading representative data into it.

  Note:

The Impala CREATE TABLE statement cannot create an HBase table, because it currently does not support the TBLPROPERTIES clause needed for HBase tables. Create such tables in Hive, then query them through Impala. For information on using Impala with HBase tables, see Using Impala to Query HBase Tables.

CREATE VIEW Statement

The CREATE VIEW statement lets you create a shorthand abbreviation for a more complicated query. The base query can involve joins, expressions, reordered columns, column aliases, and other SQL features that can make a query hard to understand or maintain.

Because a view is purely a logical construct (an alias for a query) with no physical data behind it, ALTER VIEW only involves changes to metadata in the metastore database, not any data files in HDFS.

CREATE VIEW view_name [(column_list)]
  AS select_statement

The CREATE VIEW statement can be useful in scenarios such as the following:

  • To turn even the most lengthy and complicated SQL query into a one-liner. You can issue simple queries against the view from applications, scripts, or interactive queries in impala-shell . For example:
    select * from view_name;
    select * from view_name order by c1 desc limit 10;
    
    The more complicated and hard-to-read the original query, the more benefit there is to simplifying the query using a view.
  • To hide the underlying table and column names, to minimize maintenance problems if those names change. In that case, you re-create the view using the new names, and all queries that use the view rather than the underlying tables keep running with no changes.
  • To experiment with optimization techniques and make the optimized queries available to all applications. For example, if you find a combination of WHERE conditions, join order, join hints, and so on that works the best for a class of queries, you can establish a view that incorporates the best-performing techniques. Applications can then make relatively simple queries against the view, without repeating the complicated and optimized logic over and over. If you later find a better way to optimize the original query, when you re-create the view, all the applications immediately take advantage of the optimized base query.
  • To simplify a whole class of related queries, especially complicated queries involving joins between multiple tables, complicated expressions in the column list, and other SQL syntax that makes the query difficult to understand and debug. For example, you might create a view that joins several tables, filters using several WHERE conditions, and selects several columns from the result set. Applications might issue queries against this view that only vary in their LIMIT, ORDER BY, and similar simple clauses.

For queries that require repeating complicated clauses over and over again, for example in the select list, ORDER BY, and GROUP BY clauses, you can use the WITH clause as an alternative to creating a view.

Examples:

create view v1 as select * from t1;
create view v2 as select c1, c3, c7 from t1;
create view v3 as select c1, cast(c3 as string) c3, concat(c4,c5) c5, trim(c6) c6, "Constant" c8 from t1;
create view v4 as select t1.c1, t2.c2 from t1 join t2 on (t1.id=t2.id);
create view some_db.v5 as select * from some_other_db.t1;

DESCRIBE Statement

The DESCRIBE statement displays metadata about a table, such as the column names and their data types. Its syntax is:

DESCRIBE [FORMATTED] table

You can use the abbreviation DESC for the DESCRIBE statement.

The DESCRIBE FORMATTED variation displays additional information, in a format familiar to users of Apache Hive. The extra information includes low-level details such as whether the table is internal or external, when it was created, the file format, whether or not the data is compressed, the location of the data in HDFS, whether the object is a table or a view, and (for views) the text of the query from the view definition.

Usage notes:

After the impalad daemons are restarted, the first query against a table can take longer than subsequent queries, because the metadata for the table is loaded before the query is processed. This one-time delay for each table can cause misleading results in benchmark tests or cause unnecessary concern. To "warm up" the Impala metadata cache, you can issue a DESCRIBE statement in advance for each table you intend to access later.

When you are dealing with data files stored in HDFS, sometimes it is important to know details such as the path of the data files for an Impala table, and the host name for the namenode. You can get this information from the DESCRIBE FORMATTED output. You specify HDFS URIs or path specifications with statements such as LOAD DATA and the LOCATION clause of CREATE TABLE or ALTER TABLE. You might also use HDFS URIs or paths with Linux commands such as hadoop and hdfs to copy, rename, and so on, data files in HDFS.

Examples:

The following example shows the results of both a standard DESCRIBE and DESCRIBE FORMATTED for different kinds of schema objects:

  • DESCRIBE for a table or a view returns the name, type, and comment for each of the columns. For a view, if the column value is computed by an expression, the column name is automatically generated as _c0, _c1, and so on depending on the ordinal number of the column.
  • A table created with no special format or storage clauses is designated as a MANAGED_TABLE (an "internal table" in Impala terminology). Its data files are stored in an HDFS directory under the default Hive data directory. By default, it uses Text data format.
  • A view is designated as VIRTUAL_VIEW in DESCRIBE FORMATTED output. Some of its properties are NULL or blank because they are inherited from the base table. The text of the query that defines the view is part of the DESCRIBE FORMATTED output.
  • A table with additional clauses in the CREATE TABLE statement has differences in DESCRIBE FORMATTED output. The output for T2 includes the EXTERNAL_TABLE keyword because of the CREATE EXTERNAL TABLE syntax, and different InputFormat and OutputFormat fields to reflect the Parquet file format.
[localhost:21000] > create table t1 (x int, y int, s string);
Query: create table t1 (x int, y int, s string)
[localhost:21000] > describe t1;
Query: describe t1
Query finished, fetching results ...
+------+--------+---------+
| name | type   | comment |
+------+--------+---------+
| x    | int    |         |
| y    | int    |         |
| s    | string |         |
+------+--------+---------+
Returned 3 row(s) in 0.13s
[localhost:21000] > describe formatted t1;
Query: describe formatted t1
Query finished, fetching results ...
+------------------------------+--------------------------------------------------------------------+----------------------+
| name                         | type                                                               | comment              |
+------------------------------+--------------------------------------------------------------------+----------------------+
| # col_name                   | data_type                                                          | comment              |
|                              | NULL                                                               | NULL                 |
| x                            | int                                                                | None                 |
| y                            | int                                                                | None                 |
| s                            | string                                                             | None                 |
|                              | NULL                                                               | NULL                 |
| # Detailed Table Information | NULL                                                               | NULL                 |
| Database:                    | describe_formatted                                                 | NULL                 |
| Owner:                       | cloudera                                                           | NULL                 |
| CreateTime:                  | Mon Jul 22 17:03:16 EDT 2013                                       | NULL                 |
| LastAccessTime:              | UNKNOWN                                                            | NULL                 |
| Protect Mode:                | None                                                               | NULL                 |
| Retention:                   | 0                                                                  | NULL                 |
| Location:                    | hdfs://127.0.0.1:8020/user/hive/warehouse/describe_formatted.db/t1 | NULL                 |
| Table Type:                  | MANAGED_TABLE                                                      | NULL                 |
| Table Parameters:            | NULL                                                               | NULL                 |
|                              | transient_lastDdlTime                                              | 1374526996           |
|                              | NULL                                                               | NULL                 |
| # Storage Information        | NULL                                                               | NULL                 |
| SerDe Library:               | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                 | NULL                 |
| InputFormat:                 | org.apache.hadoop.mapred.TextInputFormat                           | NULL                 |
| OutputFormat:                | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat         | NULL                 |
| Compressed:                  | No                                                                 | NULL                 |
| Num Buckets:                 | 0                                                                  | NULL                 |
| Bucket Columns:              | []                                                                 | NULL                 |
| Sort Columns:                | []                                                                 | NULL                 |
+------------------------------+--------------------------------------------------------------------+----------------------+
Returned 26 row(s) in 0.03s
[localhost:21000] > create view v1 as select x, upper(s) from t1;
Query: create view v1 as select x, upper(s) from t1
[localhost:21000] > describe v1;
Query: describe v1
Query finished, fetching results ...
+------+--------+---------+
| name | type   | comment |
+------+--------+---------+
| x    | int    |         |
| _c1  | string |         |
+------+--------+---------+
Returned 2 row(s) in 0.10s
[localhost:21000] > describe formatted v1;
Query: describe formatted v1
Query finished, fetching results ...
+------------------------------+------------------------------+----------------------+
| name                         | type                         | comment              |
+------------------------------+------------------------------+----------------------+
| # col_name                   | data_type                    | comment              |
|                              | NULL                         | NULL                 |
| x                            | int                          | None                 |
| _c1                          | string                       | None                 |
|                              | NULL                         | NULL                 |
| # Detailed Table Information | NULL                         | NULL                 |
| Database:                    | describe_formatted           | NULL                 |
| Owner:                       | cloudera                     | NULL                 |
| CreateTime:                  | Mon Jul 22 16:56:38 EDT 2013 | NULL                 |
| LastAccessTime:              | UNKNOWN                      | NULL                 |
| Protect Mode:                | None                         | NULL                 |
| Retention:                   | 0                            | NULL                 |
| Table Type:                  | VIRTUAL_VIEW                 | NULL                 |
| Table Parameters:            | NULL                         | NULL                 |
|                              | transient_lastDdlTime        | 1374526598           |
|                              | NULL                         | NULL                 |
| # Storage Information        | NULL                         | NULL                 |
| SerDe Library:               | null                         | NULL                 |
| InputFormat:                 | null                         | NULL                 |
| OutputFormat:                | null                         | NULL                 |
| Compressed:                  | No                           | NULL                 |
| Num Buckets:                 | 0                            | NULL                 |
| Bucket Columns:              | []                           | NULL                 |
| Sort Columns:                | []                           | NULL                 |
|                              | NULL                         | NULL                 |
| # View Information           | NULL                         | NULL                 |
| View Original Text:          | SELECT x, upper(s) FROM t1   | NULL                 |
| View Expanded Text:          | SELECT x, upper(s) FROM t1   | NULL                 |
+------------------------------+------------------------------+----------------------+
Returned 28 row(s) in 0.03s
[localhost:21000] > create external table t2 (x int, y int, s string) stored as parquetfile location '/user/cloudera/sample_data';
Query: create external table t2 (x int, y int, s string) stored as parquetfile location '/user/cloudera/sample_data'
[localhost:21000] > describe formatted t2;
Query: describe formatted t2
Query finished, fetching results ...
+------------------------------+----------------------------------------------------+----------------------+
| name                         | type                                               | comment              |
+------------------------------+----------------------------------------------------+----------------------+
| # col_name                   | data_type                                          | comment              |
|                              | NULL                                               | NULL                 |
| x                            | int                                                | None                 |
| y                            | int                                                | None                 |
| s                            | string                                             | None                 |
|                              | NULL                                               | NULL                 |
| # Detailed Table Information | NULL                                               | NULL                 |
| Database:                    | describe_formatted                                 | NULL                 |
| Owner:                       | cloudera                                           | NULL                 |
| CreateTime:                  | Mon Jul 22 17:01:47 EDT 2013                       | NULL                 |
| LastAccessTime:              | UNKNOWN                                            | NULL                 |
| Protect Mode:                | None                                               | NULL                 |
| Retention:                   | 0                                                  | NULL                 |
| Location:                    | hdfs://127.0.0.1:8020/user/cloudera/sample_data    | NULL                 |
| Table Type:                  | EXTERNAL_TABLE                                     | NULL                 |
| Table Parameters:            | NULL                                               | NULL                 |
|                              | EXTERNAL                                           | TRUE                 |
|                              | transient_lastDdlTime                              | 1374526907           |
|                              | NULL                                               | NULL                 |
| # Storage Information        | NULL                                               | NULL                 |
| SerDe Library:               | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL                 |
| InputFormat:                 | com.cloudera.impala.hive.serde.ParquetInputFormat  | NULL                 |
| OutputFormat:                | com.cloudera.impala.hive.serde.ParquetOutputFormat | NULL                 |
| Compressed:                  | No                                                 | NULL                 |
| Num Buckets:                 | 0                                                  | NULL                 |
| Bucket Columns:              | []                                                 | NULL                 |
| Sort Columns:                | []                                                 | NULL                 |
+------------------------------+----------------------------------------------------+----------------------+
Returned 27 row(s) in 0.17s

DISTINCT Operator

The DISTINCT operator in a SELECT statement filters the result set to remove duplicates:

-- Returns the unique values from one column.
-- NULL is included in the set of values if any rows have a NULL in this column.
select distinct c_birth_country from customer;
-- Returns the unique combinations of values from multiple columns.
select distinct c_salutation, c_last_name from customer;

You can use DISTINCT in combination with an aggregation function, typically COUNT(), to find how many different values a column contains:

-- Counts the unique values from one column.
-- NULL is not included as a distinct value in the count.
select count(distinct c_birth_country) from customer;
-- Counts the unique combinations of values from multiple columns.
select count(distinct c_salutation, c_last_name) from customer;

One construct that Impala SQL does not support is using DISTINCT in more than one aggregation function in the same query. For example, you could not have a single query with both COUNT(DISTINCT c_first_name) and COUNT(DISTINCT c_last_name) in the SELECT list.

  Note:

In contrast with some database systems that always return DISTINCT values in sorted order, Impala does not do any ordering of DISTINCT values. Always include an ORDER BY clause if you need the values in alphabetical or numeric sorted order.

DOUBLE Data Type

An 8-byte (double precision) floating-point data type used in CREATE TABLE and ALTER TABLE statements.

Range: 4.94065645841246544e-324d .. 1.79769313486231570e+308, positive or negative

Conversions: Impala does not automatically convert DOUBLE to any other type. You can use CAST() to convert DOUBLE values to FLOAT, TINYINT, SMALLINT, INT, BIGINT, STRING, TIMESTAMP, or BOOLEAN. You can use exponential notation in DOUBLE literals or when casting from STRING, for example 1.0e6 to represent one million.

Related information: Mathematical Functions

DROP DATABASE Statement

Removes a database from the system, and deletes the corresponding *.db directory from HDFS. The database must be empty before it can be dropped, to avoid losing any data.

The syntax for dropping a database is:

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name;

Before dropping a database, use a combination of DROP TABLE, DROP VIEW, ALTER TABLE, and ALTER VIEW statements, to drop all the tables and views in the database or move them to other databases.

Examples:

See CREATE DATABASE Statement for examples covering CREATE DATABASE, USE, and DROP DATABASE.

DROP TABLE Statement

The syntax for dropping a table is:

DROP TABLE [IF EXISTS] table_name

By default, Impala removes the associated HDFS directory and data files for the table. If the table was created with the EXTERNAL clause, Impala leaves all files and directories untouched.

Make sure that you are in the correct database before dropping a table, either by issuing a USE statement first or by using a fully qualified name db_name.table_name.

Examples:

create database temporary;
use temporary;
create table unimportant (x int);
create table trivial (s string);
-- Drop a table in the current database.
drop table unimportant;
-- Switch to a different database.
use default;
-- To drop a table in a different database...
drop table trivial;
ERROR: AnalysisException: Table does not exist: default.trivial
-- ...use a fully qualified name.
drop table temporary.trivial;

To drop a set of related tables quickly, either because you do not need them anymore or to free up space, consider putting them in the same database and using a single DROP DATABASE statement instead of a DROP TABLE statement for each table. See CREATE DATABASE Statement for examples showing sequences of CREATE DATABASE, CREATE TABLE, and DROP DATABASE statements.

DROP VIEW Statement

Removes the specified view, which was originally created by the CREATE VIEW statement. Because a view is purely a logical construct (an alias for a query) with no physical data behind it, DROP VIEW only involves changes to metadata in the metastore database, not any data files in HDFS.

DROP VIEW [database_name.]view_name

EXPLAIN Statement

Returns the execution plan for a statement, showing the low-level mechanisms that Impala will use to read the data, divide the work among nodes in the cluster, and transmit intermediate and final results across the network. Use explain followed by a complete SELECT query. For example:

[impalad-host:21000] > explain select count(*) from customer_address;
PLAN FRAGMENT 0
  PARTITION: UNPARTITIONED

  3:AGGREGATE
  |  output: SUM(<slot 0>)
  |  group by: 
  |  tuple ids: 1 
  |  
  2:EXCHANGE
     tuple ids: 1 

PLAN FRAGMENT 1
  PARTITION: RANDOM

  STREAM DATA SINK
    EXCHANGE ID: 2
    UNPARTITIONED

  1:AGGREGATE
  |  output: COUNT(*)
  |  group by: 
  |  tuple ids: 1 
  |  
  0:SCAN HDFS
     table=default.customer_address #partitions=1 size=5.25MB
     tuple ids: 0

You can interpret the output to judge whether the query is performing efficiently, and adjust the query and/or the schema if not. For example, you might change the tests in the WHERE clause, add hints to make join operations more efficient, introduce subqueries, change the order of tables in a join, add or change partitioning for a table, collect column statistics and/or table statistics in Hive, or any other performance tuning steps.

If you come from a traditional database background and are not familiar with data warehousing, keep in mind that Impala is optimized for full table scans across very large tables. The structure and distribution of this data is typically not suitable for the kind of indexing and single-row lookups that are common in OLTP environments. Seeing a query scan entirely through a large table is quite common, not necessarily an indication of an inefficient query. Of course, if you can reduce the volume of scanned data by orders of magnitude, for example by using a query that affects only certain partitions within a partitioned table, then you might speed up a query so that it executes in seconds rather than minutes or hours.

External Tables

The syntax CREATE EXTERNAL TABLE sets up an Impala table that points at existing data files. This operation saves the expense of importing the data into a new table when you already have the data files in a known location in HDFS, in the desired file format.

  • You can use Impala to query the data in this table.
  • If you add or replace data using HDFS operations, issue the REFRESH command in impala-shell so that Impala recognizes the changes in data files, block locations, and so on.
  • When you issue a DROP TABLE statement in Impala, that removes the connection that Impala has with the associated data files, but does not physically remove the underlying data. You can continue to use the data files with other Hadoop components and HDFS operations.

FLOAT Data Type

A 4-byte (single precision) floating-point data type used in CREATE TABLE and ALTER TABLE statements.

Range: 1.40129846432481707e-45 .. 3.40282346638528860e+38, positive or negative

Conversions: Impala automatically converts FLOAT to more precise DOUBLE values, but not the other way around. You can use CAST() to convert FLOAT values to TINYINT, SMALLINT, INT, BIGINT, STRING, TIMESTAMP, or BOOLEAN. You can use exponential notation in FLOAT literals or when casting from STRING, for example 1.0e6 to represent one million.

Related information: Mathematical Functions

GROUP BY Clause

Specify the GROUP BY clause in queries that use aggregation functions, such as COUNT(), SUM(), AVG(), MIN(), and MAX(). Specify in the GROUP BY clause the names of all the columns that do not participate in the aggregation operation.

For example, the following query finds the 5 items that sold the highest total quantity (using the SUM() function, and also counts the number of sales transactions for those items (using the COUNT() function). Because the column representing the item IDs is not used in any aggregation functions, we specify that column in the GROUP BY clause.

select
  ss_item_sk as Item,
  count(ss_item_sk) as Times_Purchased,
  sum(ss_quantity) as Total_Quantity_Purchased
from store_sales
  group by ss_item_sk 
  order by sum(ss_quantity) desc
  limit 5;
+-------+-----------------+--------------------------+
| item  | times_purchased | total_quantity_purchased |
+-------+-----------------+--------------------------+
| 9325  | 372             | 19072                    |
| 4279  | 357             | 18501                    |
| 7507  | 371             | 18475                    |
| 5953  | 369             | 18451                    |
| 16753 | 375             | 18446                    |
+-------+-----------------+--------------------------+

The HAVING clause lets you filter the results of aggregate functions, because you cannot refer to those expressions in the WHERE clause. For example, to find the 5 lowest-selling items that were included in at least 100 sales transactions, we could use this query:

select
  ss_item_sk as Item,
  count(ss_item_sk) as Times_Purchased,
  sum(ss_quantity) as Total_Quantity_Purchased
from store_sales
  group by ss_item_sk 
  having times_purchased >= 100
  order by sum(ss_quantity)
  limit 5;
+-------+-----------------+--------------------------+
| item  | times_purchased | total_quantity_purchased |
+-------+-----------------+--------------------------+
| 13943 | 105             | 4087                     |
| 2992  | 101             | 4176                     |
| 4773  | 107             | 4204                     |
| 14350 | 103             | 4260                     |
| 11956 | 102             | 4275                     |
+-------+-----------------+--------------------------+

When performing calculations involving scientific or financial data, remember that columns with type FLOAT or DOUBLE are stored as true floating-point numbers, which cannot precisely represent every possible fractional value. Thus, if you include a FLOAT or DOUBLE column in a GROUP BY clause, the results might not precisely match literal values in your query or from an original Text data file. Use rounding operations, the BETWEEN operator, or another arithmetic technique to match floating-point values that are "near" literal values you expect. For example, this query on the ss_wholesale_cost column returns cost values that are close but not identical to the original figures that were entered as decimal fractions.

select ss_wholesale_cost, avg(ss_quantity * ss_sales_price) as avg_revenue_per_sale
  from sales
  group by ss_wholesale_cost
  order by avg_revenue_per_sale desc
  limit 5;
+-------------------+----------------------+
| ss_wholesale_cost | avg_revenue_per_sale |
+-------------------+----------------------+
| 96.94000244140625 | 4454.351539300434    |
| 95.93000030517578 | 4423.119941283189    |
| 98.37999725341797 | 4332.516490316291    |
| 97.97000122070312 | 4330.480601655014    |
| 98.52999877929688 | 4291.316953108634    |
+-------------------+----------------------+

Notice how wholesale cost values originally entered as decimal fractions such as 96.94 and 98.38 are slightly larger or smaller in the result set, due to precision limitations in the hardware floating-point types. The imprecise representation of FLOAT and DOUBLE values is why financial data processing systems often store currency using data types that are less space-efficient but avoid these types of rounding errors.

HAVING Clause

Performs a filter operation on a SELECT query, by examining the results of aggregation functions rather than testing each individual table row. Thus always used in conjunction with a function such as COUNT(), SUM(), AVG(), MIN(), or MAX(), and typically with the GROUP BY clause also.

Hints

The Impala SQL dialect supports query hints, for fine-tuning the inner workings of queries. Specify hints as a temporary workaround for expensive queries, where missing statistics or other factors cause inefficient performance. The hints are represented as keywords surrounded by [] square brackets.

Currently, all the hints control the execution strategy for join queries. Specify one of the following constructs immediately after the JOIN keyword in a query:

  • [SHUFFLE] - Makes that join operation use the "partitioned" technique, which divides up corresponding rows from both tables using a hashing algorithm, sending subsets of the rows to other nodes for processing. (The keyword SHUFFLE is used to indicate a "partitioned join", because that type of join is not related to "partitioned tables".) Since the alternative "broadcast" join mechanism is the default when table and index statistics are unavailable, you might use this hint for queries where broadcast joins are unsuitable; typically, partitioned joins are more efficient for joins between large tables of similar size.
  • [BROADCAST] - Makes that join operation use the "broadcast" technique that sends the entire contents of the right-hand table to all nodes involved in processing the join. This is the default mode of operation when table and index statistics are unavailable, so you would typically only need it if stale metadata caused Impala to mistakenly choose a partitioned join operation. Typically, broadcast joins are more efficient in cases where one table is much smaller than the other. (Put the smaller table on the right side of the JOIN operator.)

To see which join strategy is used for a particular query, examine the EXPLAIN output for that query.

  Note:

Because hints can prevent queries from taking advantage of new metadata or improvements in query planning, use them only when required to work around performance issues, and be prepared to remove them when they are no longer required, such as after a new Impala release or bug fix.

For example, this query joins a large customer table with a small lookup table of less than 100 rows. The right-hand table can be broadcast efficiently to all nodes involved in the join. Thus, you would use the [broadcast] hint to force a broadcast join strategy:

select customer.address, state_lookup.state_name
  from customer join [broadcast] state_lookup
  on (customer.state_id = state_lookup.state_id);

This query joins two large tables of unpredictable size. You might benchmark the query with both kinds of hints and find that it is more efficient to transmit portions of each table to other nodes for processing. Thus, you would use the [shuffle] hint to force a partitioned join strategy:

select weather.wind_velocity, geospatial.altitude
  from weather join [shuffle] geospatial
  on (weather.lat = geospatial.lat and weather.long = geospatial.long);

For joins involving three or more tables, the hint applies to the tables on either side of that specific JOIN keyword. The joins are processed from left to right. For example, this query joins t1 and t2 using a partitioned join, then joins that result set to t3 using a broadcast join:

select t1.name, t2.id, t3.price
  from t1 join [shuffle] t2 join [broadcast] t3
  on (t1.id = t2.id and t2.id = t3.id);

For more background information and performance considerations for join queries, see Joins.

INSERT Statement

Impala supports inserting into tables and partitions that you create with the Impala CREATE TABLE statement, or pre-defined tables and partitions created through Hive.

Impala currently supports:

  • INSERT INTO to append data to a table.
  • INSERT OVERWRITE to replace the data in a table.
  • Copy data from another table using SELECT query.
  • An optional WITH clause before the INSERT keyword, to define a subquery referenced in the SELECT portion.
  • Create one or more new rows using constant expressions through VALUES clause. (The VALUES clause was added in Impala 1.0.1.)
  • Specify the names or order of columns to be inserted, different than the columns of the table being queried by the INSERT statement. (This feature was added in Impala 1.1.)
  Note:
  • Insert commands that partition or add files result in changes to Hive metadata. Because Impala uses Hive metadata, such changes may necessitate a Hive metadata refresh. For more information, see the REFRESH function.
  • Currently, Impala can only insert data into tables that use the TEXT and Parquet formats. For other file formats, insert the data using Hive and use Impala to query it.

The following example sets up new tables with the same definition as the TAB1 table from the Tutorial section, using different file formats, and demonstrates inserting data into the tables created with the STORED AS TEXTFILE and STORED AS PARQUETFILE clauses:

CREATE DATABASE IF NOT EXISTS file_formats;
USE file_formats;

DROP TABLE IF EXISTS text_table;
CREATE TABLE text_table
( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP )
STORED AS TEXTFILE;

DROP TABLE IF EXISTS parquet_table;
CREATE TABLE parquet_table
( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP )
STORED AS PARQUETFILE;

With the INSERT INTO TABLE syntax, each new set of inserted rows is appended to any existing data in the table. This is how you would record small amounts of data that arrive continuously, or ingest new batches of data alongside the existing data. For example, after running 2 INSERT INTO TABLE statements with 5 rows each, the table contains 10 rows total:

[localhost:21000] > insert into table text_table select * from default.tab1;
Inserted 5 rows in 0.41s

[localhost:21000] > insert into table text_table select * from default.tab1;
Inserted 5 rows in 0.46s

[localhost:21000] > select count(*) from text_table;
+----------+
| count(*) |
+----------+
| 10       |
+----------+
Returned 1 row(s) in 0.26s

With the INSERT OVERWRITE TABLE syntax, each new set of inserted rows replaces any existing data in the table. This is how you load data to query in a data warehousing scenario where you analyze just the data for a particular day, quarter, and so on, discarding the previous data each time. You might keep the entire set of data in one raw table, and transfer and transform certain rows into a more compact and efficient form to perform intensive analysis on that subset.

For example, here we insert 5 rows into a table using the INSERT INTO clause, then replace the data by inserting 3 rows with the INSERT OVERWRITE clause. Afterward, the table only contains the 3 rows from the final INSERT statement.

[localhost:21000] > insert into table parquet_table select * from default.tab1;
Inserted 5 rows in 0.35s

[localhost:21000] > insert overwrite table parquet_table select * from default.tab1 limit 3;
Inserted 3 rows in 0.43s
[localhost:21000] > select count(*) from parquet_table;
+----------+
| count(*) |
+----------+
| 3        |
+----------+
Returned 1 row(s) in 0.43s

The VALUES clause lets you insert one or more rows by specifying constant values for all the columns. The number, types, and order of the expressions must match the table definition.

  Note: The INSERT ... VALUES technique is not suitable for loading large quantities of data into HDFS-based tables, because the insert operations cannot be parallelized, and each one produces a separate data file. Use it for setting up small dimension tables or tiny amounts of data for experimenting with SQL syntax, or with HBase tables. Do not use it for large ETL jobs or benchmark tests for load operations. Do not run scripts with thousands of INSERT ... VALUES statements that insert a single row each time. If you do run INSERT ... VALUES operations to load data into a staging table as one stage in an ETL pipeline, include multiple row values if possible within each VALUES clause, and use a separate database to make cleanup easier if the operation does produce many tiny files.

The following example shows how to insert one row or multiple rows, with expressions of different types, using literal values, expressions, and function return values:

create table val_test_1 (c1 int, c2 float, c3 string, c4 boolean, c5 timestamp);
insert into val_test_1 values (100, 99.9/10, 'abc', true, now());
create table val_test_2 (id int, token string);
insert overwrite val_test_2 values (1, 'a'), (2, 'b'), (-1,'xyzzy');

These examples show the type of "not implemented" error that you see when attempting to insert data into a table with a file format that Impala currently does not write to:

DROP TABLE IF EXISTS sequence_table;
CREATE TABLE sequence_table
( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP )
STORED AS SEQUENCEFILE;

DROP TABLE IF EXISTS rc_table;
CREATE TABLE rc_table
( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP )
STORED AS RCFILE;

[localhost:21000] > insert into table rc_table select * from default.tab1;
Query: insert into table rc_table select * from default.tab1
Remote error
Backend 0:RC_FILE not implemented.

[localhost:21000] > insert into table sequence_table select * from default.tab1;
Query: insert into table sequence_table select * from default.tab1
Remote error
Backend 0:SEQUENCE_FILE not implemented.

Inserting data into partitioned tables requires slightly different syntax that divides the partitioning columns from the others:

create table t1 (i int) partitioned by (x int, y string);
-- Select an INT column from another table.
-- All inserted rows will have the same x and y values, as specified in the INSERT statement.
insert into t1 partition(x=10, y='a') select c1 from some_other_table;
-- Select two INT columns from another table.
-- All inserted rows will have the same y value, as specified in the INSERT statement.
-- Values from c2 go into t1.x.
-- Any partitioning columns whose value is not specified are filled in
-- from the columns specified last in the SELECT list.
insert into t1 partition(x, y='b') select c1, c2 from some_other_table;
-- Select an INT and a STRING column from another table.
-- All inserted rows will have the same x value, as specified in the INSERT statement.
-- Values from c3 go into t1.y.
insert into t1 partition(x=20, y) select c1, c3  from some_other_table;
The following example shows how you can copy the data in all the columns from one table to another, copy the data from only some columns, or specify the columns in the select list in a different order than they actually appear in the table:
-- Start with 2 identical tables.
create table t1 (c1 int, c2 int);
create table t2 like t1;

-- If there is no () part after the destination table name,
-- all columns must be specified, either as * or by name.
insert into t2 select * from t1;
insert into t2 select c1, c2 from t1;

-- With the () notation following the destination table name,
-- you can omit columns (all values for that column are NULL
-- in the destination table), and/or reorder the values
-- selected from the source table. This is the "column permutation" feature.
insert into t2 (c1) select c1 from t1;
insert into t2 (c2, c1) select c1, c2 from t1;

-- The column names can be entirely different in the source and destination tables.
-- You can copy any columns, not just the corresponding ones, from the source table.
-- But the number and type of selected columns must match the columns mentioned in the () part.
alter table t2 replace columns (x int, y int);
insert into t2 (y) select c1 from t1;

-- For partitioned tables, all the partitioning columns must be mentioned in the () column list
-- or a PARTITION clause; these columns cannot be defaulted to NULL.
create table pt1 (x int, y int) partitioned by (z int);
-- The values from c1 are copied into the column x in the new table,
-- all in the same partition based on a constant value for z.
-- The values of y in the new table are all NULL.
insert into pt1 (x) partition (z=5) select c1 from t1;
-- Again we omit the values for column y so they are all NULL.
-- The inserted x values can go into different partitions, based on
-- the different values inserted into the partitioning column z.
insert into pt1 (x,z) select x, z from t2;

Concurrency considerations: Each INSERT operation creates new data files with unique names, so you can run multiple INSERT INTO statements simultaneously with conflicts. If data is inserted into a table by a statement issued to a different impalad node, issue a REFRESH table_name statement to make the node you are connected to aware of this new data. While data is being inserted into an Impala table, the data is staged temporarily in a subdirectory inside the data directory; during this period, you cannot issue queries against that table in Hive. If an INSERT operation fails, the temporary data file and the subdirectory could be left behind in the data directory. If so, remove the relevant subdirectory and any data files it contains manually, by issuing an hdfs dfs -rm -r command, specifying the full path of the work subdirectory, whose name ends in _dir.

INT Data Type

A 4-byte integer data type used in CREATE TABLE and ALTER TABLE statements.

Range: -2147483648 .. 2147483647. There is no UNSIGNED subtype.

Conversions: Impala automatically converts to a larger integer type (BIGINT) or a floating-point type (FLOAT or DOUBLE) automatically. Use CAST() to convert to TINYINT, SMALLINT, STRING, or TIMESTAMP. Casting an integer value N to TIMESTAMP produces a value that is N seconds past the start of the epoch date (January 1, 1970).

Related information: Mathematical Functions

Internal Tables

The default kind of table produced by the CREATE TABLE statement is known as an internal table. (Its counterpart is the external table, produced by the CREATE EXTERNAL TABLE syntax.)

  • Impala creates a directory in HDFS to hold the data files.
  • You load data by issuing INSERT statements in impala-shell or by using the LOAD DATA statement in Hive.
  • When you issue a DROP TABLE statement, Impala physically removes all the data files from the directory.

INVALIDATE METADATA Statement

Marks the metadata for one or all tables as stale. Required after a table is created through the Hive shell, before the table is available for Impala queries. The next time the current Impala node performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. This is a relatively expensive operation compared to the incremental metadata update done by the REFRESH statement, so in the common scenario of adding new data files to an existing table, prefer REFRESH rather than INVALIDATE METADATA. If you are not familiar with the way Impala uses metadata and how it shares the same metastore database as Hive, see Overview of Impala Metadata and the Metastore for background information.

To accurately respond to queries, Impala must have current metadata about those databases and tables that clients query directly. Therefore, if some other entity modifies information used by Impala in the metastore that Impala and Hive share, the information cached by Impala must be updated. However, this does not mean that all metadata updates require an Impala update.

  Note:

The INVALIDATE METADATA statement is new in Impala 1.1, and takes over some of the use cases of the Impala 1.0 REFRESH statement. Because REFRESH now requires a table name parameter, to flush the metadata for all tables at once, use the INVALIDATE METADATA statement.

Because REFRESH table_name only works for tables that the current Impala node is already aware of, when you create a new table in the Hive shell or through a different Impala node, you must enter INVALIDATE METADATA with no table parameter before you can see the new table in impala-shell. Once the table is known the the Impala node, you can issue REFRESH table_name after you add data files for that table.

INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA waits to reload the metadata when needed for a subsequent query, but reloads all the metadata for the table, which can be an expensive operation, especially for large tables with many partitions. REFRESH reloads the metadata immediately, but only loads the block location data for newly added data files, making it a less expensive operation overall. If data was altered in some more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, the INVALIDATE METADATA statement works just like the Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is optimized for the common use case of adding new data files to an existing table, thus the table name argument is now required.

The syntax for the INVALIDATE METADATA command is:

INVALIDATE METADATA [table_name]

By default, the cached metadata for all tables is flushed. If you specify a table name, only the metadata for that one table is flushed. Even for a single table, INVALIDATE METADATA is more expensive than REFRESH, so prefer REFRESH in the common case where you add new data files for an existing table.

A metadata update for an impalad instance is required if:

  • A metadata change occurs.
  • and the change is made from another impalad instance in your cluster, or through Hive.
  • and the change is made to a database to which clients such as the Impala shell or ODBC directly connect.

A metadata update for an Impala node is not required when you issue queries from the same Impala node where you ran ALTER TABLE, INSERT, or other table-modifying statement.

Database and table metadata is typically modified by:

  • Hive - via ALTER, CREATE, DROP or INSERT operations.
  • Impalad - via CREATE TABLE, ALTER TABLE, and INSERT operations.

INVALIDATE METADATA causes the metadata for that table to be marked as stale, and reloaded the next time the table is referenced. For a huge table, that process could take a noticeable amount of time; thus you might prefer to use REFRESH where practical, to avoid an unpredictable delay later, for example if the next reference to the table is during a benchmark test.

The following example shows how you might use the INVALIDATE METADATA statement after creating new tables (such as Avro or HBase tables) through the Hive shell. Before the INVALIDATE METADATA statement was issued, Impala would give a "table not found" error if you tried to refer to those table names. The DESCRIBE statements cause the latest metadata to be immediately loaded for the tables, avoiding a delay the next time those tables are queried.

[impalad-host:21000] > invalidate metadata;
[impalad-host:21000] > describe t1;
...
[impalad-host:21000] > describe t2;
...

If you need to ensure that the metadata is up-to-date when you start an impala-shell session, run impala-shell with the -r or --refresh_after_connect command-line option. Because this operation adds a delay to the next query against each table, potentially expensive for large tables with many partitions, try to avoid using this option for day-to-day operations in a production environment.

Joins

A join query is one that combines data from two or more tables, and returns a result set containing items from some or all of those tables. You typically use join queries in situations like these:

  • When related data arrives from different sources, with each data set physically residing in a separate table. For example, you might have address data from business records that you cross-check against phone listings or census data.
  • When data is normalized, a technique for reducing data duplication by dividing it across multiple tables. This kind of organization is often found in data that comes from traditional relational database systems. For example, instead of repeating some long string such as a customer name in multiple tables, each table might contain a numeric customer ID. Queries that need to display the customer name could "join" the table that specifies which customer ID corresponds to which name.
  • When certain columns are rarely needed for queries, so they are moved into separate tables to reduce overhead for common queries. For example, a biography field might be rarely needed in queries on employee data. Putting that field in a separate table reduces the amount of I/O for common queries on employee addresses or phone numbers. Queries that do need the biography column can retrieve it by performing a join with that separate table.
  Note:

Join ordering affects Impala performance. For best results:

  • Join the biggest table first.
  • Join subsequent tables according to which table has the most selective filter. Joining the table with the most selective filter results in the fewest number of rows being returned.

The result set from a join query is filtered by including the corresponding join column names in an ON clause, or by using comparison operators referencing columns from both tables in the WHERE clause.

[localhost:21000] > select c_last_name, ca_city from customer join customer_address where c_customer_sk = ca_address_sk;
+-------------+-----------------+
| c_last_name | ca_city         |
+-------------+-----------------+
| Lewis       | Fairfield       |
| Moses       | Fairview        |
| Hamilton    | Pleasant Valley |
| White       | Oak Ridge       |
| Moran       | Glendale        |
...
| Richards    | Lakewood         |
| Day         | Lebanon          |
| Painter     | Oak Hill         |
| Bentley     | Greenfield       |
| Jones       | Stringtown       |
+-------------+------------------+
Returned 50000 row(s) in 9.82s

One potential downside of joins is the possibility of excess resource usage in poorly constructed queries. For example, if T1 contains 1000 rows and T2 contains 1000 rows, a query SELECT columns FROM t1 JOIN t2 could return up to 1 million rows (1000 * 1000). To minimize the chance of runaway queries on large data sets, Impala requires every join query to contain at least one equality predicate between the columns of the various tables.

Because even with equality clauses, the result set can still be large, as we saw in the previous example, you might use a LIMIT clause to return a subset of the results:

[localhost:21000] > select c_last_name, ca_city from customer, customer_address where c_customer_sk = ca_address_sk limit 10;
+-------------+-----------------+
| c_last_name | ca_city         |
+-------------+-----------------+
| Lewis       | Fairfield       |
| Moses       | Fairview        |
| Hamilton    | Pleasant Valley |
| White       | Oak Ridge       |
| Moran       | Glendale        |
| Sharp       | Lakeview        |
| Wiles       | Farmington      |
| Shipman     | Union           |
| Gilbert     | New Hope        |
| Brunson     | Martinsville    |
+-------------+-----------------+
Returned 10 row(s) in 0.63s

Or you might use additional comparison operators or aggregation functions to condense a large result set into a smaller set of values:

[localhost:21000] > -- Find the names of customers who live in one particular town.
[localhost:21000] > select distinct c_last_name from customer, customer_address where
  c_customer_sk = ca_address_sk
  and ca_city = "Green Acres";
+---------------+
| c_last_name   |
+---------------+
| Hensley       |
| Pearson       |
| Mayer         |
| Montgomery    |
| Ricks         |
...
| Barrett       |
| Price         |
| Hill          |
| Hansen        |
| Meeks         |
+---------------+
Returned 332 row(s) in 0.97s

[localhost:21000] > -- See how many different customers in this town have names starting with "A".
[localhost:21000] > select count(distinct c_last_name) from customer, customer_address where
  c_customer_sk = ca_address_sk
  and ca_city = "Green Acres"
  and substr(c_last_name,1,1) = "A";
+-----------------------------+
| count(distinct c_last_name) |
+-----------------------------+
| 12                          |
+-----------------------------+
Returned 1 row(s) in 1.00s

Because a join query can involve reading large amounts of data from disk, sending large amounts of data across the network, and loading large amounts of data into memory to do the comparisons and filtering, you might do benchmarking, performance analysis, and query tuning to find the most efficient join queries for your data set, hardware capacity, network configuration, and cluster workload.

The two categories of joins in Impala are known as partitioned joins and broadcast joins. If inaccurate table or column statistics, or some quirk of the data distribution, causes Impala to choose the wrong mechanism for a particular join, consider using query hints as a temporary workaround. For details, see Hints.

LIKE Operator

A comparison operator for STRING data, with basic wildcard capability using _ to match a single character and % to match multiple characters. The argument expression must match the entire string value. Typically, it is more efficient to put any % wildcard match at the end of the string.

Examples:

select distinct c_last_name from customer where c_last_name like 'Mc%' or c_last_name like 'Mac%';
select count(c_last_name) from customer where c_last_name like 'M%';
select c_email_address from customer where c_email_address like '%.edu';

-- We can find 4-letter names beginning with 'M' by calling functions...
select distinct c_last_name from customer where length(c_last_name) = 4 and substr(c_last_name,1,1) = 'M';
-- ...or in a more readable way by matching M followed by exactly 3 characters.
select distinct c_last_name from customer where c_last_name like 'M___';

LIMIT Clause

The LIMIT clause in a SELECT query sets a maximum number of rows for the result set. It is useful in contexts such as:

  • To return exactly N items from a top-N query, such as the 10 highest-rated items in a shopping category or the 50 hostnames that refer the most traffic to a web site.
  • To demonstrate some sample values from a table or a particular query, for a query with no ORDER BY clause.
  • To keep queries from returning huge result sets by accident if a table is larger than expected, or a WHERE clause matches more rows than expected.

Top-N queries are so common, and tables queried by Impala are typically so large, that Impala requires any query including an ORDER BY clause to also use a LIMIT clause, to cap the size of the overall result set. You can specify the LIMIT clause as part of the query, or set a default limit for all queries.

LOAD DATA Statement

The LOAD DATA statement streamlines the ETL process for an internal Impala table by moving a data file or all the data files in a directory from an HDFS location into the Impala data directory for that table.

Syntax:

LOAD DATA INPATH 'hdfs_file_or_directory_path' [OVERWRITE] INTO TABLE tablename
  [PARTITION (partcol1=val1, partcol2=val2 ...)]

Usage Notes:

  • The loaded data files are moved, not copied, into the Impala data directory.
  • You can specify the HDFS path of a single file to be moved, or the HDFS path of a directory to move all the files inside that directory. You cannot specify any sort of wildcard to take only some of the files from a directory. When loading a directory full of data files, keep all the data files at the top level, with no nested directories underneath.
  • Currently, the Impala LOAD DATA statement only imports files from HDFS, not from the local filesystem. It does not support the LOCAL keyword of the Hive LOAD DATA statement. You must specify a path, not an hdfs:// URI.
  • In the interest of speed, only limited error checking is done. If the loaded files have the wrong file format, different columns than the destination table, or other kind of mismatch, Impala does not raise any error for the LOAD DATA statement. Querying the table afterward could produce a runtime error or unexpected results. Currently, the only checking the LOAD DATA statement does is to avoid mixing together uncompressed and LZO-compressed text files in the same table.
  • When you specify an HDFS directory name as the LOAD DATA argument, any hidden files in that directory (files whose names start with a .) are not moved to the Impala data directory.
  • The loaded data files retain their original names in the new location, unless a name conflicts with an existing data file, in which case the name of the new file is modified slightly to be unique. (The name-mangling is a slight difference from the Hive LOAD DATA statement, which replaces identically named files.)
  • By providing an easy way to transport files from known locations in HDFS into the Impala data directory structure, the LOAD DATA statement lets you avoid memorizing the locations and layout of HDFS directory tree containing the Impala databases and tables. (For a quick way to check the location of the data files for an Impala table, issue the statement DESCRIBE FORMATTED table_name.)
  • The PARTITION clause is especially convenient for ingesting new data for a partitioned table. As you receive new data for a time period, geographic region, or other division that corresponds to one or more partitioning columns, you can load that data straight into the appropriate Impala data directory, which might be nested several levels down if the table is partitioned by multiple columns. When the table is partitioned, you must specify constant values for all the partitioning columns.

Examples:

First, we use a trivial Python script to write different numbers of strings (one per line) into files stored in the cloudera HDFS user account. (Substitute the path for your own HDFS user account when doing hdfs dfs operations like these.)

$ random_strings.py 1000 | hdfs dfs -put - /user/cloudera/thousand_strings.txt
$ random_strings.py 100 | hdfs dfs -put - /user/cloudera/hundred_strings.txt
$ random_strings.py 10 | hdfs dfs -put - /user/cloudera/ten_strings.txt

Next, we create a table and load an initial set of data into it. Remember, unless you specify a STORED AS clause, Impala tables default to TEXTFILE format with Ctrl-A (hex 01) as the field delimiter. This example uses a single-column table, so the delimiter is not significant. For large-scale ETL jobs, you would typically use binary format data files such as Parquet or Avro, and load them into Impala tables that use the corresponding file format.

[localhost:21000] > create table t1 (s string);
[localhost:21000] > load data inpath '/user/cloudera/thousand_strings.txt' into table t1;
Query finished, fetching results ...
+----------------------------------------------------------+
| summary                                                  |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 1 |
+----------------------------------------------------------+
Returned 1 row(s) in 0.61s
[kilo2-202-961.cs1cloud.internal:21000] > select count(*) from t1;
Query finished, fetching results ...
+------+
| _c0  |
+------+
| 1000 |
+------+
Returned 1 row(s) in 0.67s
[localhost:21000] > load data inpath '/user/cloudera/thousand_strings.txt' into table t1;
ERROR: AnalysisException: INPATH location '/user/cloudera/thousand_strings.txt' does not exist.

As indicated by the message at the end of the previous example, the data file was moved from its original location. The following example illustrates how the data file was moved into the Impala data directory for the destination table, keeping its original filename:

$ hdfs dfs -ls /user/hive/warehouse/load_data_testing.db/t1
Found 1 items
-rw-r--r--   1 cloudera cloudera      13926 2013-06-26 15:40 /user/hive/warehouse/load_data_testing.db/t1/thousand_strings.txt

The following example demonstrates the difference between the INTO TABLE and OVERWRITE TABLE clauses. The table already contains 1000 rows. After issuing the LOAD DATA statement with the INTO TABLE clause, the table contains 100 more rows, for a total of 1100. After issuing the LOAD DATA statement with the OVERWRITE INTO TABLE clause, the former contents are gone, and now the table only contains the 10 rows from the just-loaded data file.

[localhost:21000] > load data inpath '/user/cloudera/hundred_strings.txt' into table t1;
Query finished, fetching results ...
+----------------------------------------------------------+
| summary                                                  |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 2 |
+----------------------------------------------------------+
Returned 1 row(s) in 0.24s
[localhost:21000] > select count(*) from t1;
Query finished, fetching results ...
+------+
| _c0  |
+------+
| 1100 |
+------+
Returned 1 row(s) in 0.55s
[localhost:21000] > load data inpath '/user/cloudera/ten_strings.txt' overwrite into table t1;
Query finished, fetching results ...
+----------------------------------------------------------+
| summary                                                  |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 1 |
+----------------------------------------------------------+
Returned 1 row(s) in 0.26s
[localhost:21000] > select count(*) from t1;
Query: select count(*) from t1
Query finished, fetching results ...
+-----+
| _c0 |
+-----+
| 10  |
+-----+
Returned 1 row(s) in 0.62s

MAX Function

An aggregation function that returns the maximum value from a set of numbers. Opposite of the MIN function. Its single argument can be numeric column, or the numeric result of a function or expression applied to the column value. Rows with a NULL value for the specified column are ignored. If the table is empty, or all the values supplied to MAX are NULL, MAX returns NULL.

When the query contains a GROUP BY clause, returns one value for each combination of grouping values.

Return type: Same as the input argument

Examples:

-- Find the largest value for this column in the table.
select max(c1) from t1;
-- Find the largest value for this column from a subset of the table.
select max(c1) from t1 where month = 'January' and year = '2013';
-- Find the largest value from a set of numeric function results.
select max(length(s)) from t1;
-- Can also be used in combination with DISTINCT and/or GROUP BY.
-- Return more than one result.
select month, year, max(purchase_price) from store_stats group by month, year;
-- Filter the input to eliminate duplicates before performing the calculation.
select max(distinct x) from t1;

MIN Function

An aggregation function that returns the minimum value from a set of numbers. Opposite of the MAX function. Its single argument can be numeric column, or the numeric result of a function or expression applied to the column value. Rows with a NULL value for the specified column are ignored. If the table is empty, or all the values supplied to MIN are NULL, MIN returns NULL.

When the query contains a GROUP BY clause, returns one value for each combination of grouping values.

Return type: Same as the input argument

Examples:

-- Find the smallest value for this column in the table.
select min(c1) from t1;
-- Find the smallest value for this column from a subset of the table.
select min(c1) from t1 where month = 'January' and year = '2013';
-- Find the smallest value from a set of numeric function results.
select min(length(s)) from t1;
-- Can also be used in combination with DISTINCT and/or GROUP BY.
-- Return more than one result.
select month, year, min(purchase_price) from store_stats group by month, year;
-- Filter the input to eliminate duplicates before performing the calculation.
select min(distinct x) from t1;

NULL

The notion of NULL values is familiar from all kinds of database systems, but each SQL dialect can have its own behavior and restrictions on NULL values. For Big Data processing, the precise semantics of NULL values are significant: any misunderstanding could lead to inaccurate results or misformatted data, that could be time-consuming to correct for large data sets.

  • NULL is a different value than an empty string. The empty string is represented by a string literal with nothing inside, "" or ''.
  • In a delimited text file, the NULL value is represented by the special token \N.
  • When Impala inserts data into a partitioned table, and the value of one of the partitioning columns is NULL or the empty string, the data is placed in a special partition that holds only these two kinds of values. When these values are returned in a query, the result is NULL whether the value was originally NULL or an empty string. This behavior is compatible with the way Hive treats NULL values in partitioned tables. Hive does not allow empty strings as partition keys, and it returns a string value such as __HIVE_DEFAULT_PARTITION__ instead of NULL when such values are returned from a query. For example:
    create table t1 (i int) partitioned by (x int, y string);
    -- Select an INT column from another table, with all rows going into a special HDFS subdirectory
    -- named __HIVE_DEFAULT_PARTITION__. Depending on whether one or both of the partitioning keys
    -- are null, this special directory name occurs at different levels of the physical data directory
    -- for the table.
    insert into t1 partition(x=NULL, y=NULL) select c1 from some_other_table;
    insert into t1 partition(x, y=NULL) select c1, c2 from some_other_table;
    insert into t1 partition(x=NULL, y) select c1, c3  from some_other_table;
    
  • There is no NOT NULL clause when defining a column to prevent NULL values in that column.
  • There is no DEFAULT clause to specify a non-NULL default value.
  • If an INSERT operation mentions some columns but not others, the unmentioned columns contain NULL for all inserted rows.

ORDER BY Clause

The familiar ORDER BY clause of a SELECT statement sorts the result set based on the values from one or more columns. For distributed queries, this is a relatively expensive operation, because the entire result set must be produced and transferred to one node before the sorting can happen. This can require more memory capacity than a query without ORDER BY. Even if the query takes approximately the same time to finish with or without the ORDER BY clause, subjectively it can appear slower because no results are available until all processing is finished, rather than results coming back gradually as rows matching the WHERE clause are found.

Because an ORDER BY clause on a query returning a huge result set can require so much capacity to perform the sort, Impala requires any query including an ORDER BY clause to also use a LIMIT clause, to cap the size of the overall result set. You can specify the LIMIT clause as part of the query, or set a default limit for all queries with the command SET DEFAULT_ORDER_BY_LIMIT=... in impala-shell or the option -default_query_options default_order_by_limit=... when starting impalad. See SELECT Statement for further examples.

REFRESH Statement

To accurately respond to queries, the Impala node that acts as the coordinator (the node to which you are connected through impala-shell, JDBC, or ODBC) must have current metadata about those databases and tables that are referenced in Impala queries. If you are not familiar with the way Impala uses metadata and how it shares the same metastore database as Hive, see Overview of Impala Metadata and the Metastore for background information.

Use the REFRESH statement to load the latest metastore metadata and block location data for a particular table in these scenarios:

  • After loading new data files into the HDFS data directory for the table. (Once you have set up an ETL pipeline to bring data into Impala on a regular basis, this is typically the most frequent reason why metadata needs to be refreshed.)
  • After issuing ALTER TABLE, INSERT, or other table-modifying SQL statement in Impala, while connected to a different Impala node.
  • After issuing ALTER TABLE, INSERT, or other table-modifying SQL statement in Hive.

You only need to issue the REFRESH statement on the node to which you connect to issue queries. The coordinator node divides the work among all the Impala nodes in a cluster, and sends read requests for the correct HDFS blocks without relying on the metadata on the other nodes.

REFRESH reloads the metadata for the table from the metastore database, and does an incremental reload of the low-level block location data to account for any new data files added to the HDFS data directory for the table. It is a low-overhead, single-table operation, specifically tuned for the common scenario where new data files are added to HDFS.

The syntax for the REFRESH command is:

REFRESH table_name

Only the metadata for the specified table is flushed. The table must already exist and be known to Impala, either because the CREATE TABLE statement was run on the same Impala node, or because a previous INVALIDATE METADATA statement caused Impala to reload its entire metadata catalog.

  Note:

The functionality of the REFRESH statement has changed in Impala 1.1. Now the table name is a required parameter. To flush the metadata for all tables, use the INVALIDATE METADATA command.

Because REFRESH table_name only works for tables that the current Impala node is already aware of, when you create a new table in the Hive shell or while connected to a different Impala node, you must enter INVALIDATE METADATA with no table parameter before you can see the new table in impala-shell on the current node. Once the table is known to the Impala node, you can issue REFRESH table_name as needed after you add more data files for that table.

INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA waits to reload the metadata when needed for a subsequent query, but reloads all the metadata for the table, which can be an expensive operation, especially for large tables with many partitions. REFRESH reloads the metadata immediately, but only loads the block location data for newly added data files, making it a less expensive operation overall. If data was altered in some more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, the INVALIDATE METADATA statement works just like the Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is optimized for the common use case of adding new data files to an existing table, thus the table name argument is now required.

A metadata update for an impalad instance is required if:

  • A metadata change occurs.
  • and the change is made by some other entity, meaning another impalad instance in your cluster or Hive.
  • and the change is made to a database to which clients such as the Impala shell or ODBC directly connect.

A metadata update for an Impala node is not required when you issue queries from the same Impala node where you ran ALTER TABLE, INSERT, or other table-modifying statement.

Database and table metadata is typically modified by:

  • Hive - through ALTER, CREATE, DROP or INSERT operations.
  • Impalad - through CREATE TABLE, ALTER TABLE, and INSERT operations.

REFRESH causes the metadata for that table to be immediately reloaded. For a huge table, that process could take a noticeable amount of time; but doing the refresh up front avoids an unpredictable delay later, for example if the next reference to the table is during a benchmark test.

The following example shows how you might use the REFRESH statement after manually adding new HDFS data files to the Impala data directory for a table:

[impalad-host:21000] > refresh t1;
[impalad-host:21000] > refresh t2;
[impalad-host:21000] > select * from t1;
...
[impalad-host:21000] > select * from t2;
...

In Impala 1.0, the -r option of impala-shell issued REFRESH to reload metadata for all tables. In Impala 1.1 and higher, this option issues INVALIDATE METADATA because REFRESH now requires a table name parameter. Due to the expense of reloading the metadata for all tables, that impala-shell option is not recommended for day-to-day use in a production environment.

REGEXP Operator

Tests whether a value matches a regular expression. Uses the POSIX regular expression syntax where ^ and $ match the beginning and end of the string, . represents any single character, * represents a sequence of zero or more items, + represents a sequence of one or more items, ? produces a non-greedy match, and so on.

The regular expression must match the entire value, not just occur somewhere inside it. Use .* at the beginning and/or the end if you only need to match characters anywhere in the middle. Thus, the ^ and $ atoms are often redundant, although you might already have them in your expression strings that you reuse from elsewhere.

The RLIKE operator is a synonym for REGEXP.

The | symbol is the alternation operator, typically used within () to match different sequences. The () groups do not allow backreferences. To retrieve the part of a value matched within a () section, use the regexp_extract() built-in function.

Examples:

-- Find all customers whose first name starts with 'J', followed by 0 or more of any character.
select c_first_name, c_last_name from customer where c_first_name regexp 'J.*';

-- Find 'Macdonald', where the first 'a' is optional and the 'D' can be upper- or lowercase.
-- The ^...$ aren't required, but make it clear we're matching the start and end of the value.
select c_first_name, c_last_name from customer where c_last_name regexp '^Ma?c[Dd]onald$';

-- Match multiple character sequences, either 'Mac' or 'Mc'.
select c_first_name, c_last_name from customer where c_last_name regexp '(Mac|Mc)donald';

-- Find names starting with 'S', then one or more vowels, then 'r', then any other characters.
-- Matches 'Searcy', 'Sorenson', 'Sauer'.
select c_first_name, c_last_name from customer where c_last_name regexp 'S[aeiou]+r.*';

-- Find names that end with 2 or more vowels: letters from the set a,e,i,o,u.
select c_first_name, c_last_name from customer where c_last_name regexp '.*[aeiou]{2,}$';

-- You can use letter ranges in the [] blocks, for example to find names starting with A, B, or C.
select c_first_name, c_last_name from customer where c_last_name regexp '[A-C].*';

-- If you are not sure about case, leading/trailing spaces, and so on, you can process the
-- column using string functions first.
select c_first_name, c_last_name from customer where lower(c_last_name) regexp 'de.*';

RLIKE Operator

Synonym for the REGEXP operator.

SELECT Statement

Impala SELECT queries support:

  • SQL data types: boolean, tinyint, smallint, int, bigint, float, double, timestamp, string.
  • An optional WITH clause before the SELECT keyword, to define a subquery whose name or column names can be referenced from later in the main query.
  • DISTINCT clause per query. See DISTINCT Operator for details.
  • Subqueries in a FROM clause.
  • WHERE, GROUP BY, HAVING clauses.
  • ORDER BY. Impala requires that queries using this keyword also include a LIMIT clause.
      Note:

    ORDER BY queries require limits on results. These limits can be set when you start Impala or they can be set in the Impala shell. Setting query options though ODBC or JDBC is not supported at this time, so in those cases, if you are using either of those connectors, set the limit value when starting Impala. For example, to set this value in the shell, use a command similar to:

    [impalad-host:21000] > set default_order_by_limit=50000

    To set the limit when starting Impala, include the -default_query_option startup parameter for the impalad daemon. For example, to start Impala with a result limit for ORDER BY queries, use a command similar to:

    $ GLOG_v=1 nohup impalad -state_store_host=state_store_hostname -hostname=impalad_hostname -default_query_options default_order_by_limit=50000
  • JOIN clauses. Left, right, semi, full, and outer joins are supported. See Joins for details.
  • UNION ALL.
  • LIMIT.
  • External tables.
  • Relational operators such as greater than, less than, or equal to.
  • Arithmetic operators such as addition or subtraction.
  • Logical/Boolean operators AND, OR, and NOT. Impala does not support the corresponding symbols &&, ||, and !.
  • Common SQL built-in functions such as COUNT, SUM, CAST, LIKE, IN, BETWEEN, and COALESCE. Impala specifically supports built-ins described in Built-in Function Support.
  • The WITH clause to abstract repeated clauses, such as aggregation functions, that are referenced multiple times in the same query. (The WITH clause actually comes before the SELECT statement in a query.)

SHOW Statement

To display a list of available databases or tables, issue these statements:

  • SHOW TABLES [IN database_name]
  • SHOW DATABASES
  • SHOW SCHEMAS - an alias for SHOW DATABASES

For example, you typically issue SHOW DATABASES to see the names you can specify in a USE db_name statement, then after switching to a database you issue SHOW TABLES to see the names you can specify in SELECT and INSERT statements.

SMALLINT Data Type

A 2-byte integer data type used in CREATE TABLE and ALTER TABLE statements.

Range: -32768 .. 32767. There is no UNSIGNED subtype.

Conversions: Impala automatically converts to a larger integer type (INT or BIGINT) or a floating-point type (FLOAT or DOUBLE) automatically. Use CAST() to convert to TINYINT, STRING, or TIMESTAMP. Casting an integer value N to TIMESTAMP produces a value that is N seconds past the start of the epoch date (January 1, 1970).

Related information: Mathematical Functions

STRING Data Type

A data type used in CREATE TABLE and ALTER TABLE statements.

Length: 32,767 bytes. (Strictly speaking, the maximum length corresponds to the C/C++ constant INT_MAX, which is 32,767 for typical Linux systems.) Do not use any length constraint when declaring STRING columns, as you might be familiar with from VARCHAR, CHAR, or similar column types from relational database systems.

Character sets: For full support in all Impala subsystems, restrict string values to the ASCII character set. Multi-byte character data can be stored in Impala and retrieved through queries, but is not guaranteed to work properly with string manipulation functions, comparison operators, or the ORDER BY clause. For any national language aspects such as collation order or interpreting extended ASCII variants such as ISO-8859-1 or ISO-8859-2 encodings, Impala does not include such metadata with the table definition. If you need to sort, manipulate, or display data depending on those national language characteristics of string data, use logic on the application side.

Conversions: Impala does not automatically convert STRING to any other type. You can use CAST() to convert STRING values to TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, or TIMESTAMP. You cannot cast a STRING value to BOOLEAN, although you can cast a BOOLEAN value to STRING, returning '1' for true values and '0' for false values.

Related information: Date and Time Functions

SUM Function

An aggregation function that returns the sum of a set of numbers. Its single argument can be numeric column, or the numeric result of a function or expression applied to the column value. Rows with a NULL value for the specified column are ignored. If the table is empty, or all the values supplied to MIN are NULL, SUM returns NULL.

When the query contains a GROUP BY clause, returns one value for each combination of grouping values.

Return type: BIGINT for integer arguments, DOUBLE for floating-point arguments

Examples:

-- Total all the values for this column in the table.
select sum(c1) from t1;
-- Find the total for this column from a subset of the table.
select sum(c1) from t1 where month = 'January' and year = '2013';
-- Find the total from a set of numeric function results.
select sum(length(s)) from t1;
-- Often used with functions that return predefined values to compute a score.
select sum(case when grade = 'A' then 1.0 when grade = 'B' then 0.75 else 0) as class_honors from test_scores;
-- Can also be used in combination with DISTINCT and/or GROUP BY.
-- Return more than one result.
select month, year, sum(purchase_price) from store_stats group by month, year;
-- Filter the input to eliminate duplicates before performing the calculation.
select sum(distinct x) from t1;

TIMESTAMP Data Type

A data type used in CREATE TABLE and ALTER TABLE statements, representing a point in time.

Range: Internally, the resolution of the time portion of a TIMESTAMP value is in nanoseconds.

Time zones: Impala does not store timestamps using the local timezone. Timestamps are stored relative to GMT.

Conversions: Impala automatically converts STRING literals of the correct format into TIMESTAMP values, for example, '1966-07-30' or '1985-09-25 17:45:30.5' You can cast an integer or floating-point value N to TIMESTAMP, producing a value that is N seconds past the start of the epoch date (January 1, 1970).

Related information: Date and Time Functions

TINYINT Data Type

A 1-byte integer data type used in CREATE TABLE and ALTER TABLE statements.

Range: -128 .. 127. There is no UNSIGNED subtype.

Conversions: Impala automatically converts to a larger integer type (SMALLINT, INT, or BIGINT) or a floating-point type (FLOAT or DOUBLE) automatically. Use CAST() to convert to STRING or TIMESTAMP. Casting an integer value N to TIMESTAMP produces a value that is N seconds past the start of the epoch date (January 1, 1970).

Related information: Mathematical Functions

USE Statement

By default, when you connect to an Impala instance through the impala-shell command, you begin in a database named default. Issue the statement USE db_name to switch to another database within an impala-shell session. The current database is where any CREATE TABLE, INSERT, SELECT, or other statements act when you specify a table without prefixing it with a database name.

Switching the default database is convenient in the following situations:

  • To avoid qualifying each reference to a table with the database name. For example, SELECT * FROM t1 JOIN t2 rather than SELECT * FROM db.t1 JOIN db.t2.
  • To do a sequence of operations all within the same database, such as creating a table, inserting data, and querying the table.

Examples:

See CREATE DATABASE Statement for examples covering CREATE DATABASE, USE, and DROP DATABASE.

VALUES Clause

The VALUES clause is a general-purpose way to specify all the columns of a row or multiple rows. You typically use the VALUES clause in an INSERT statement to specify all the column values for one or more rows as they are added to a table.

  Note: The INSERT ... VALUES technique is not suitable for loading large quantities of data into HDFS-based tables, because the insert operations cannot be parallelized, and each one produces a separate data file. Use it for setting up small dimension tables or tiny amounts of data for experimenting with SQL syntax, or with HBase tables. Do not use it for large ETL jobs or benchmark tests for load operations. Do not run scripts with thousands of INSERT ... VALUES statements that insert a single row each time. If you do run INSERT ... VALUES operations to load data into a staging table as one stage in an ETL pipeline, include multiple row values if possible within each VALUES clause, and use a separate database to make cleanup easier if the operation does produce many tiny files.

The following examples illustrate:

  • How to insert a single row using a VALUES clause.
  • How to insert multiple rows using a VALUES clause.
  • How the row or rows from a VALUES clause can be appended to a table through INSERT INTO, or replace the contents of the table through INSERT OVERWRITE.
  • How the entries in a VALUES clause can be literals, function results, or any other kind of expression.
[localhost:21000] > describe val_example;
Query: describe val_example
Query finished, fetching results ...
+-------+---------+---------+
| name  | type    | comment |
+-------+---------+---------+
| id    | int     |         |
| col_1 | boolean |         |
| col_2 | double  |         |
+-------+---------+---------+

[localhost:21000] > insert into val_example values (1,true,100.0);
Inserted 1 rows in 0.30s
[localhost:21000] > select * from val_example;
+----+-------+-------+
| id | col_1 | col_2 |
+----+-------+-------+
| 1  | true  | 100   |
+----+-------+-------+

[localhost:21000] > insert overwrite val_example values (10,false,pow(2,5)), (50,true,10/3);
Inserted 2 rows in 0.16s
[localhost:21000] > select * from val_example;
+----+-------+-------------------+
| id | col_1 | col_2             |
+----+-------+-------------------+
| 10 | false | 32                |
| 50 | true  | 3.333333333333333 |
+----+-------+-------------------+

When used in an INSERT statement, the Impala VALUES clause does not support specifying a subset of the columns in the table or specifying the columns in a different order. Use a VALUES clause with all the column values in the same order as the table definition, using NULL values for any columns you want to omit from the INSERT operation.

To use a VALUES clause like a table in other statements, wrap it in parentheses and use AS clauses to specify aliases for the entire object and any columns you need to refer to:

[localhost:21000] > select * from (values(4,5,6),(7,8,9)) as t;
+---+---+---+
| 4 | 5 | 6 |
+---+---+---+
| 4 | 5 | 6 |
| 7 | 8 | 9 |
+---+---+---+
[localhost:21000] > select * from (values(1 as c1, true as c2, 'abc' as c3),(100,false,'xyz')) as t;
+-----+-------+-----+
| c1  | c2    | c3  |
+-----+-------+-----+
| 1   | true  | abc |
| 100 | false | xyz |
+-----+-------+-----+

For example, you might use a tiny table constructed like this from constant literals or function return values as part of a longer statement involving joins or UNION ALL.

Views

Views are lightweight logical constructs that act as aliases for queries. You can specify a view name in a query (a SELECT statement or the SELECT portion of an INSERT statement) where you would usually specify a table name.

A view lets you:

  • Set up fine-grained security where a user can query some columns from a table but not other columns. See Controlling Access at the Column Level through Views for details.
  • Issue complicated queries with compact and simple syntax:
    -- Take a complicated reporting query, plug it into a CREATE VIEW statement...
    create view v1 as select c1, c2, avg(c3) from t1 group by c3 order by c1 desc limit 10;
    -- ... and now you can produce the report with 1 line of code.
    select * from v1;
    
  • Reduce maintenance, by avoiding the duplication of complicated queries across multiple applications in multiple languages:
    create view v2 as select t1.c1, t1.c2, t2.c3 from t1 join t2 on (t1.id = t2.id);
    -- This simple query is safer to embed in reporting applications than the longer query above.
    -- The view definition can remain stable even if the structure of the underlying tables changes.
    select c1, c2, c3 from v2;
    
  • Build a new, more refined query on top of the original query by adding new clauses, select-list expressions, function calls, and so on:
    create view average_price_by_category as select category, avg(price) as avg_price from products group by category;
    create view expensive_categories as select category, avg_price from average_price_by_category order by avg_price desc limit 10000;
    create view top_10_expensive_categories as select category, avg_price from expensive_categories limit 10;
    
    This technique lets you build up several more or less granular variations of the same query, and switch between them when appropriate.
  • Set up aliases with intuitive names for tables, columns, result sets from joins, and so on:
    -- The original tables might have cryptic names inherited from a legacy system.
    create view action_items as select rrptsk as assignee, treq as due_date, dmisc as notes from vxy_t1_br;
    -- You can leave original names for compatibility, build new applications using more intuitive ones.
    select assignee, due_date, notes from action_items;
    
  • Swap tables with others that use different file formats, partitioning schemes, and so on without any downtime for data copying or conversion:
    create table slow (x int, s string) stored as textfile;
    create view report as select s from slow where x between 20 and 30;
    -- Query is kind of slow due to inefficient table definition, but it works.
    select * from report;
    
    create table fast (s string) partitioned by (x int) stored as parquetfile;
    -- ...Copy data from SLOW to FAST. Queries against REPORT view continue to work...
    
    -- After changing the view definition, queries will be faster due to partitioning,
    -- binary format, and compression in the new table.
    alter view report as select s from fast where x between 20 and 30;
    select * from report;
    
  • Avoid coding lengthy subqueries and repeating the same subquery text in many other queries.

The SQL statements that configure views are CREATE VIEW Statement, ALTER VIEW Statement, and DROP VIEW Statement. You can specify view names when querying data (SELECT Statement) and copying data from one table to another (INSERT Statement). The WITH clause creates an inline view, that only exists for the duration of a single query.

[localhost:21000] > create view trivial as select * from customer;
[localhost:21000] > create view some_columns as select c_first_name, c_last_name, c_login from customer;
[localhost:21000] > select * from some_columns limit 5;
Query finished, fetching results ...
+--------------+-------------+---------+
| c_first_name | c_last_name | c_login |
+--------------+-------------+---------+
| Javier       | Lewis       |         |
| Amy          | Moses       |         |
| Latisha      | Hamilton    |         |
| Michael      | White       |         |
| Robert       | Moran       |         |
+--------------+-------------+---------+
[localhost:21000] > create view ordered_results as select * from some_columns order by c_last_name desc, c_first_name desc limit 1000;
[localhost:21000] > select * from ordered_results limit 5;
Query: select * from ordered_results limit 5
Query finished, fetching results ...
+--------------+-------------+---------+
| c_first_name | c_last_name | c_login |
+--------------+-------------+---------+
| Thomas       | Zuniga      |         |
| Sarah        | Zuniga      |         |
| Norma        | Zuniga      |         |
| Lloyd        | Zuniga      |         |
| Lisa         | Zuniga      |         |
+--------------+-------------+---------+
Returned 5 row(s) in 0.48s

The previous example uses descending order for ORDERED_RESULTS because in the sample TPCD-H data, there are some rows with empty strings for both C_FIRST_NAME and C_LAST_NAME, making the lowest-ordered names unuseful in a sample query.

create view visitors_by_day as select day, count(distinct visitors) as howmany from web_traffic group by day;
create view busiest_days as select day, howmany from visitors_by_day order by howmany desc;
create view top_10_days as select day, howmany from busiest_days limit 10;
select * from top_10_days;
To see the definition of a view, issue a DESCRIBE FORMATTED statement, which shows the query from the original CREATE VIEW statement:
[localhost:21000] > create view v1 as select * from t1;
[localhost:21000] > describe formatted v1;
Query finished, fetching results ...
+------------------------------+------------------------------+----------------------+
| name                         | type                         | comment              |
+------------------------------+------------------------------+----------------------+
| # col_name                   | data_type                    | comment              |
|                              | NULL                         | NULL                 |
| x                            | int                          | None                 |
| y                            | int                          | None                 |
| s                            | string                       | None                 |
|                              | NULL                         | NULL                 |
| # Detailed Table Information | NULL                         | NULL                 |
| Database:                    | views                        | NULL                 |
| Owner:                       | cloudera                     | NULL                 |
| CreateTime:                  | Mon Jul 08 15:56:27 EDT 2013 | NULL                 |
| LastAccessTime:              | UNKNOWN                      | NULL                 |
| Protect Mode:                | None                         | NULL                 |
| Retention:                   | 0                            | NULL                 |
| Table Type:                  | VIRTUAL_VIEW                 | NULL                 |
| Table Parameters:            | NULL                         | NULL                 |
|                              | transient_lastDdlTime        | 1373313387           |
|                              | NULL                         | NULL                 |
| # Storage Information        | NULL                         | NULL                 |
| SerDe Library:               | null                         | NULL                 |
| InputFormat:                 | null                         | NULL                 |
| OutputFormat:                | null                         | NULL                 |
| Compressed:                  | No                           | NULL                 |
| Num Buckets:                 | 0                            | NULL                 |
| Bucket Columns:              | []                           | NULL                 |
| Sort Columns:                | []                           | NULL                 |
|                              | NULL                         | NULL                 |
| # View Information           | NULL                         | NULL                 |
| View Original Text:          | SELECT * FROM t1             | NULL                 |
| View Expanded Text:          | SELECT * FROM t1             | NULL                 |
+------------------------------+------------------------------+----------------------+
Returned 29 row(s) in 0.05s

Restrictions:

  • You cannot insert into an Impala view. (In some database systems, this operation is allowed and inserts rows into the base table.) You can use a view name on the right-hand side of an INSERT statement, in the SELECT part.

WITH Clause

A clause that can be added before a SELECT statement, to define aliases for complicated expressions that are referenced multiple times within the body of the SELECT. Similar to CREATE VIEW, except that the table and column names defined in the WITH clause do not persist after the query finishes, and do not conflict with names used in actual tables or views. Also known as "subquery factoring".

You can rewrite a query using subqueries to work the same as with the WITH clause. The purposes of the WITH clause are:

  • Convenience and ease of maintenance from less repetition with the body of the query. Typically used with queries involving UNION, joins, or aggregation functions where the similar complicated expressions are referenced multiple times.
  • SQL code that is easier to read and understand by abstracting the most complex part of the query into a separate block.
  • Improved compatibility with SQL from other database systems that support the same clause (primarily Oracle Database).
      Note:

    The Impala WITH clause does not support recursive queries in the WITH, which is supported in some other database systems.

Standards compliance: Introduced in SQL:1999.

Examples:

-- Define 2 subqueries that can be referenced from the body of a longer query.
with t1 as (select 1), t2 as (select 2) insert into tab select * from t1 union all select * from t2;

-- Define one subquery at the outer level, and another at the inner level as part of the
-- initial stage of the UNION ALL query.
with t1 as (select 1) (with t2 as (select 2) select * from t2) union all select * from t1;