This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Schema Objects and Object Names

With Impala, you work schema objects that are familiar to database users: primarily databases, tables, views, and functions. The SQL syntax to work with these objects is explained in SQL Statements. This section explains the conceptual knowledge you need to work with these objects and the various ways to specify their names.

Within a table, partitions can also be considered a kind of object. Partitioning is an important subject for Impala, with its own documentation section covering use cases and performance considerations. See Partitioning for details.

Impala does not have a counterpart of the "tablespace" notion from some database systems. By default, all the data files for a database, table, or partition are located within nested folders within the HDFS file system. You can also specify a particular HDFS location for a given Impala table or partition. The raw data for these objects is represented as a collection of data files, providing the flexibility to load data by simply moving files into the expected HDFS location.

Information about the schema objects is held in the metastore database. This database is shared between Impala and Hive, allowing each to create, drop, and query each other's databases, tables, and so on. When Impala makes a change to schema objects through a CREATE, ALTER, DROP, INSERT, or LOAD DATA statement, it broadcasts those changes to all nodes in the cluster through the catalog service. When you make such changes through Hive or directly through manipulating HDFS files, you use the REFRESH or INVALIDATE METADATA statements on the Impala side to recognize the newly loaded data, new tables, and so on.

Continue reading:

Aliases

When you write the names of tables, columns, or column expressions in a query, you can assign an alias at the same time. Then you can specify the alias rather than the original name when making other references to the table or column in the same statement. You typically specify aliases that are shorter, easier to remember, or both than the original names. The aliases are printed in the query header, making them useful for self-documenting output.

To set up an alias, add the AS alias clause immediately after any table, column, or expression name in the SELECT list or FROM list of a query. The AS keyword is optional; you can also specify the alias immediately after the original name.

To use an alias name that matches one of the Impala reserved keywords (listed in Appendix C - Impala Reserved Words), surround the identifier with either single or double quotation marks, or `` characters (backticks).

select c1 as name, c2 as address, c3 as phone from table_with_terse_columns;
select sum(ss_xyz_dollars_net) as total_sales from table_with_cryptic_columns;
select one.name, two.address, three.phone from
  census one, building_directory two, phonebook three
  where one.id = two.id and two.id = three.id;

Aliases follow the same rules as identifiers when it comes to case insensitivity. Aliases can be longer than identifiers (up to the maximum length of a Java string) and can include additional characters such as spaces and dashes when they are quoted using backtick characters.

Alternatives:

Another way to define different names for the same tables or columns is to create views. See Views for details.

Identifiers

Identifiers are the names of databases, tables, or columns that you specify in a SQL statement. The rules for identifiers govern what names you can give to things you create, the notation for referring to names containing unusual characters, and other aspects such as case sensitivity.

  • The minimum length of an identifier is 1 character.
  • The maximum length of an identifier is currently 128 characters, enforced by the metastore database.
  • An identifier must start with an alphabetic character. The remainder can contain any combination of alphanumeric characters and underscores. Quoting the identifier with backticks has no effect on the allowed characters in the name.
  • An identifier can contain only ASCII characters.
  • To use an identifier name that matches one of the Impala reserved keywords (listed in Appendix C - Impala Reserved Words), surround the identifier with `` characters (backticks).
  • Impala identifiers are always case-insensitive. That is, tables named t1 and T1 always refer to the same table, regardless of quote characters. Internally, Impala always folds all specified table and column names to lowercase. This is why the column headers in query output are always displayed in lowercase.

See Aliases for how to define shorter or easier-to-remember aliases if the original names are long or cryptic identifiers. Aliases follow the same rules as identifiers when it comes to case insensitivity. Aliases can be longer than identifiers (up to the maximum length of a Java string) and can include additional characters such as spaces and dashes when they are quoted using backtick characters.

Another way to define different names for the same tables or columns is to create views. See Views for details.

Databases

In Impala, a database is a logical container for a group of tables. Each database defines a separate namespace. Within a database, you can refer to the tables inside it using their unqualified names. Different databases can contain tables with identical names.

Creating a database is a lightweight operation. There are no database-specific properties to configure. Therefore, there is no ALTER DATABASE

Typically, you create a separate database for each project or application, to avoid naming conflicts between tables and to make clear which tables are related to each other.

Each database is physically represented by a directory in HDFS.

There is a special database, named default, where you begin when you connect to Impala. Tables created in default are physically located one level higher in HDFS than all the user-created databases.

Related statements: CREATE DATABASE Statement, DROP DATABASE Statement, USE Statement

Tables

Tables are the primary containers for data in Impala. They have the familiar row and column layout similar to other database systems, plus some features such as partitioning often associated with higher-end data warehouse systems.

Logically, each table has a structure based on the definition of its columns, partitions, and other properties.

Physically, each table is associated with a directory in HDFS. The table data consists of all the data files underneath that directory:

  • Internal tables, managed by Impala, use directories inside the designated Impala work area.
  • External tables use arbitrary HDFS directories, where the data files are typically shared between different Hadoop components.
  • Large-scale data is usually handled by partitioned tables, where the data files are divided among different HDFS subdirectories.

Related statements: CREATE TABLE Statement, DROP TABLE Statement, ALTER TABLE Statement INSERT Statement, LOAD DATA Statement, SELECT Statement

Internal Tables

The default kind of table produced by the CREATE TABLE statement is known as an internal table. (Its counterpart is the external table, produced by the CREATE EXTERNAL TABLE syntax.)

  • Impala creates a directory in HDFS to hold the data files.
  • You load data by issuing INSERT statements in impala-shell or by using the LOAD DATA statement in Hive.
  • When you issue a DROP TABLE statement, Impala physically removes all the data files from the directory.

External Tables

The syntax CREATE EXTERNAL TABLE sets up an Impala table that points at existing data files, potentially in HDFS locations outside the normal Impala data directories.. This operation saves the expense of importing the data into a new table when you already have the data files in a known location in HDFS, in the desired file format.

  • You can use Impala to query the data in this table.
  • If you add or replace data using HDFS operations, issue the REFRESH command in impala-shell so that Impala recognizes the changes in data files, block locations, and so on.
  • When you issue a DROP TABLE statement in Impala, that removes the connection that Impala has with the associated data files, but does not physically remove the underlying data. You can continue to use the data files with other Hadoop components and HDFS operations.

Views

Views are lightweight logical constructs that act as aliases for queries. You can specify a view name in a query (a SELECT statement or the SELECT portion of an INSERT statement) where you would usually specify a table name.

A view lets you:

  • Set up fine-grained security where a user can query some columns from a table but not other columns. See Controlling Access at the Column Level through Views for details.
  • Issue complicated queries with compact and simple syntax:
    -- Take a complicated reporting query, plug it into a CREATE VIEW statement...
    create view v1 as select c1, c2, avg(c3) from t1 group by c3 order by c1 desc limit 10;
    -- ... and now you can produce the report with 1 line of code.
    select * from v1;
  • Reduce maintenance, by avoiding the duplication of complicated queries across multiple applications in multiple languages:
    create view v2 as select t1.c1, t1.c2, t2.c3 from t1 join t2 on (t1.id = t2.id);
    -- This simple query is safer to embed in reporting applications than the longer query above.
    -- The view definition can remain stable even if the structure of the underlying tables changes.
    select c1, c2, c3 from v2;
  • Build a new, more refined query on top of the original query by adding new clauses, select-list expressions, function calls, and so on:
    create view average_price_by_category as select category, avg(price) as avg_price from products group by category;
    create view expensive_categories as select category, avg_price from average_price_by_category order by avg_price desc limit 10000;
    create view top_10_expensive_categories as select category, avg_price from expensive_categories limit 10;
    This technique lets you build up several more or less granular variations of the same query, and switch between them when appropriate.
  • Set up aliases with intuitive names for tables, columns, result sets from joins, and so on:
    -- The original tables might have cryptic names inherited from a legacy system.
    create view action_items as select rrptsk as assignee, treq as due_date, dmisc as notes from vxy_t1_br;
    -- You can leave original names for compatibility, build new applications using more intuitive ones.
    select assignee, due_date, notes from action_items;
  • Swap tables with others that use different file formats, partitioning schemes, and so on without any downtime for data copying or conversion:
    create table slow (x int, s string) stored as textfile;
    create view report as select s from slow where x between 20 and 30;
    -- Query is kind of slow due to inefficient table definition, but it works.
    select * from report;
    
    create table fast (s string) partitioned by (x int) stored as parquet;
    -- ...Copy data from SLOW to FAST. Queries against REPORT view continue to work...
    
    -- After changing the view definition, queries will be faster due to partitioning,
    -- binary format, and compression in the new table.
    alter view report as select s from fast where x between 20 and 30;
    select * from report;
  • Avoid coding lengthy subqueries and repeating the same subquery text in many other queries.

The SQL statements that configure views are CREATE VIEW Statement, ALTER VIEW Statement, and DROP VIEW Statement. You can specify view names when querying data (SELECT Statement) and copying data from one table to another (INSERT Statement). The WITH clause creates an inline view, that only exists for the duration of a single query.

[localhost:21000] > create view trivial as select * from customer;
[localhost:21000] > create view some_columns as select c_first_name, c_last_name, c_login from customer;
[localhost:21000] > select * from some_columns limit 5;
Query finished, fetching results ...
+--------------+-------------+---------+
| c_first_name | c_last_name | c_login |
+--------------+-------------+---------+
| Javier       | Lewis       |         |
| Amy          | Moses       |         |
| Latisha      | Hamilton    |         |
| Michael      | White       |         |
| Robert       | Moran       |         |
+--------------+-------------+---------+
[localhost:21000] > create view ordered_results as select * from some_columns order by c_last_name desc, c_first_name desc limit 1000;
[localhost:21000] > select * from ordered_results limit 5;
Query: select * from ordered_results limit 5
Query finished, fetching results ...
+--------------+-------------+---------+
| c_first_name | c_last_name | c_login |
+--------------+-------------+---------+
| Thomas       | Zuniga      |         |
| Sarah        | Zuniga      |         |
| Norma        | Zuniga      |         |
| Lloyd        | Zuniga      |         |
| Lisa         | Zuniga      |         |
+--------------+-------------+---------+
Returned 5 row(s) in 0.48s

The previous example uses descending order for ORDERED_RESULTS because in the sample TPCD-H data, there are some rows with empty strings for both C_FIRST_NAME and C_LAST_NAME, making the lowest-ordered names unuseful in a sample query.

create view visitors_by_day as select day, count(distinct visitors) as howmany from web_traffic group by day;
create view busiest_days as select day, howmany from visitors_by_day order by howmany desc;
create view top_10_days as select day, howmany from busiest_days limit 10;
select * from top_10_days;
To see the definition of a view, issue a DESCRIBE FORMATTED statement, which shows the query from the original CREATE VIEW statement:
[localhost:21000] > create view v1 as select * from t1;
[localhost:21000] > describe formatted v1;
Query finished, fetching results ...
+------------------------------+------------------------------+----------------------+
| name                         | type                         | comment              |
+------------------------------+------------------------------+----------------------+
| # col_name                   | data_type                    | comment              |
|                              | NULL                         | NULL                 |
| x                            | int                          | None                 |
| y                            | int                          | None                 |
| s                            | string                       | None                 |
|                              | NULL                         | NULL                 |
| # Detailed Table Information | NULL                         | NULL                 |
| Database:                    | views                        | NULL                 |
| Owner:                       | cloudera                     | NULL                 |
| CreateTime:                  | Mon Jul 08 15:56:27 EDT 2013 | NULL                 |
| LastAccessTime:              | UNKNOWN                      | NULL                 |
| Protect Mode:                | None                         | NULL                 |
| Retention:                   | 0                            | NULL                 |
| Table Type:                  | VIRTUAL_VIEW                 | NULL                 |
| Table Parameters:            | NULL                         | NULL                 |
|                              | transient_lastDdlTime        | 1373313387           |
|                              | NULL                         | NULL                 |
| # Storage Information        | NULL                         | NULL                 |
| SerDe Library:               | null                         | NULL                 |
| InputFormat:                 | null                         | NULL                 |
| OutputFormat:                | null                         | NULL                 |
| Compressed:                  | No                           | NULL                 |
| Num Buckets:                 | 0                            | NULL                 |
| Bucket Columns:              | []                           | NULL                 |
| Sort Columns:                | []                           | NULL                 |
|                              | NULL                         | NULL                 |
| # View Information           | NULL                         | NULL                 |
| View Original Text:          | SELECT * FROM t1             | NULL                 |
| View Expanded Text:          | SELECT * FROM t1             | NULL                 |
+------------------------------+------------------------------+----------------------+
Returned 29 row(s) in 0.05s

Restrictions:

  • You cannot insert into an Impala view. (In some database systems, this operation is allowed and inserts rows into the base table.) You can use a view name on the right-hand side of an INSERT statement, in the SELECT part.
  • If a view applies to a partitioned table, any partition pruning is determined by the clauses in the original query. Impala does not prune additional columns if the query on the view includes extra WHERE clauses referencing the partition key columns.

Related statements: CREATE VIEW Statement, ALTER VIEW Statement, DROP VIEW Statement

Functions

Functions let you apply arithmetic, string, or other computations and transformations to Impala data. You typically use them in SELECT lists and WHERE clauses to filter and format query results so that the result set is exactly what you want, with no further processing needed on the application side.

Scalar functions return a single result for each input row. See Built-in Functions.

Aggregate functions combine the results from multiple rows. See Aggregate Functions.

User-defined functions let you code your own logic. They can be either scalar or aggregate functions. See User-Defined Functions (UDFs).

Related statements: CREATE FUNCTION Statement, DROP FUNCTION Statement

Page generated September 3, 2015.