This is the documentation for Cloudera 5.2.x.
Documentation for other versions is available at Cloudera Documentation.

New Features in Impala

New Features in Impala Version 2.0.0 / CDH 5.2.0

The following are the major new features in Impala 2.0. This major release, available both with CDH 5.2 and for CDH 4, contains improvements to performance, scalability, security, and SQL syntax.

  • Queries no longer fail with an out-of-memory error due to joins or aggregation functions involving high volumes of data. When the required memory for the intermediate result set exceeds the amount available on a particular node, the query automatically uses a temporary work area on disk. This "spill to disk" mechanism is similar to the ORDER BY improvement from Impala 1.4. For details, see SQL Operations that Spill to Disk.

  • Subquery enhancements:
    • Subqueries are now allowed in the WHERE clause, for example with the IN operator.
    • The EXISTS and NOT EXISTS operators are available. They are always used in conjunction with subqueries.
    • The IN and NOT IN queries can now operate on the result set from a subquery, not just a hardcoded list of values.
    • Uncorrelated subqueries let you compare against one or more values for equality, IN, and EXISTS comparisons. For example, you might use WHERE clauses such as WHERE column = (SELECT MAX(some_other_column FROM table) or WHERE column IN (SELECT some_other_column FROM table WHERE conditions).
    • Correlated subqueries let you cross-reference values from the outer query block and the subquery.
    • Scalar subqueries let you substitute the result of single-value aggregate functions such as MAX(), MIN(), COUNT(), or AVG(), where you would normally use a numeric value in a WHERE clause.

    For details about subqueries, see Subqueries For information about new and improved operators, see EXISTS Operator and IN Operator.

  • Analytic functions such as RANK(), LAG(), LEAD(), and FIRST_VALUE() let you analyze sequences of rows with flexible ordering and grouping. Existing aggregate functions such as MAX(), SUM(), and COUNT() can also be used in an analytic context. See Analytic Functions for details. See Impala Aggregate Functions for enhancements to existing aggregate functions.

  • New data types provide greater compatibility with source code from traditional database systems:

  • Security enhancements:

    The new security-related SQL statements work along with the Sentry authorization framework. See Enabling Sentry Authorization for Impala for details.

  • Impala can now read compressed text files compressed by gzip, bzip, or Snappy. These files do not require any special table settings to work in an Impala text table. Impala recognizes the compression type automatically based on file extensions of .gz, .bz2, and .snappy respectively. These types of compressed text files are intended for convenience with existing ETL pipelines. Their non-splittable nature means they are not optimal for high-performance parallel queries. See Using gzip, bzip2, or Snappy-Compressed Text Files for details.

  • Query hints can now use comment notation, /* +hint_name */ or -- +hint_name, at the same places in the query where the hints enclosed by [ ] are recognized. This enhancement makes it easier to reuse Impala queries on other database systems. See Hints for details.

  • A new query option, QUERY_TIMEOUT_S, lets you specify a timeout period in seconds for individual queries.

    The working of the --idle_query_timeout configuration option is extended. If no QUERY_OPTION_S query option is in effect, --idle_query_timeout works the same as before, setting the timeout interval. When the QUERY_OPTION_S query option is specified, its maximum value is capped by the value of the --idle_query_timeout option.

    That is, the system administrator sets the default and maximum timeout through the --idle_query_timeout startup option, and then individual users or applications can set a lower timeout value if desired through the QUERY_TIMEOUT_S query option. See Setting Timeout Periods for Daemons, Queries, and Sessions and QUERY_TIMEOUT_S for details.

  • New functions VAR_SAMP() and VAR_POP() are aliases for the existing VARIANCE_SAMP() and VARIANCE_POP() functions.

  • A new date and time function, DATE_PART(), provides similar functionality to EXTRACT(). You can also call the EXTRACT() function using the SQL-99 syntax, EXTRACT(unit FROM timestamp). These enhancements simplify the porting process for date-related code from other systems. See Impala Date and Time Functions for details.

  • New approximation features provide a fast way to get results when absolute precision is not required:

    • The APPX_COUNT_DISTINCT query option lets Impala rewrite COUNT(DISTINCT) calls to use NDV() instead, which speeds up the operation and allows multiple COUNT(DISTINCT) operations in a single query. See APPX_COUNT_DISTINCT for details.
    The APPX_MEDIAN() aggregate function produces an estimate for the median value of a column by using sampling. See APPX_MEDIAN Function for details.
  • Impala now supports a DECODE() function. This function works as a shorthand for a CASE() expression, and improves compatibility with SQL code containing vendor extensions. See Impala Conditional Functions for details.

  • The STDDEV(), STDDEV_POP(), STDDEV_SAMP(), VARIANCE(), VARIANCE_POP(), VARIANCE_SAMP(), and NDV() aggregate functions now all return DOUBLE results rather than STRING. Formerly, you were required to CAST() the result to a numeric type before using it in arithmetic operations.

  • The default settings for Parquet block size, and the associated PARQUET_FILE_SIZE query option, are changed. Now, Impala writes Parquet files with a size of 256 MB and an HDFS block size of 256 MB. Previously, Impala attempted to write Parquet files with a size of 1 GB and an HDFS block size of 1 GB. In practice, Impala used a conservative estimate of the disk space needed for each Parquet block, leading to files that were typically 512 MB anyway. Thus, this change will make the file size more accurate if you specify a value for the PARQUET_FILE_SIZE query option. It also reduces the amount of memory reserved during INSERT into Parquet tables, potentially avoiding out-of-memory errors and improving scalability when inserting data into Parquet tables.

  • Anti-joins are now supported, expressed using the LEFT ANTI JOIN and RIGHT ANTI JOIN clauses. These clauses returns results from one table that have no match in the other table. You might use this type of join in the same sorts of use cases as the NOT EXISTS and NOT IN operators. See Joins for details.

  • The SET command in impala-shell has been promoted to a real SQL statement. You can now set query options such as PARQUET_FILE_SIZE, MEM_LIMIT, and SYNC_DDL within JDBC, ODBC, or any other kind of application that submits SQL without going through the impala-shell interpreter. See SET Statement for details.

  • The impala-shell interpreter now reads settings from an optional configuration file, named $HOME/.impalarc by default. See impala-shell Configuration Options for details.

New Features in Impala Version 1.4.2 / CDH 5.1.3

Impala 1.4.2 is purely a bug-fix release. It does not include any new features.

  Note: Impala 1.4.2 is only available as part of CDH 5.1.3, not under CDH 4.

New Features in Impala Version 1.4.1 / CDH 5.1.2

Impala 1.4.1 is purely a bug-fix release. It does not include any new features.

New Features in Impala Version 1.4.0 / CDH 5.1.0

  • The DECIMAL data type lets you store fixed-precision values, for working with currency or other fractional values where it is important to represent values exactly and avoid rounding errors. This feature includes enhancements to built-in functions, numeric literals, and arithmetic expressions. See DECIMAL Data Type (CDH 5.1 or later only) for details.

  • On CDH 5, Impala can take advantage of the HDFS caching feature to "pin" entire tables or individual partitions in memory, to speed up queries on frequently accessed data and reduce the CPU overhead of memory-to-memory copying. When HDFS files are cached in memory, Impala can read the cached data without any disk reads, and without making an additional copy of the data in memory. Other Hadoop components that read the same data files also experience a performance benefit.

    For background information about HDFS caching in CDH, see the CDH 5 Installation Guide. For performance information about using this feature with Impala, see Using HDFS Caching with Impala (CDH 5.1 or later only). For the SET CACHED and SET UNCACHED clauses that let you control cached table data through DDL statements, see CREATE TABLE Statement and ALTER TABLE Statement.

  • Impala can now use Sentry-based authorization based either on the original policy file, or on rules defined by GRANT and REVOKE statements issued through Hive. See Enabling Sentry Authorization for Impala for details.

  • For interoperability with Parquet files created through other Hadoop components, such as Pig or MapReduce jobs, you can create an Impala table that automatically sets up the column definitions based on the layout of an existing Parquet data file. See CREATE TABLE Statement for the syntax, and Creating Parquet Tables in Impala for usage information.

  • ORDER BY queries no longer require a LIMIT clause. If the size of the result set to be sorted exceeds the memory available to Impala, Impala uses a temporary work space on disk to perform the sort operation. See ORDER BY Clause for details.

  • LDAP connections can be secured through either SSL or TLS. See Enabling LDAP Authentication for Impala for details.

  • The following new built-in scalar and aggregate functions are available:

    • A new built-in function, EXTRACT(), returns one date or time field from a TIMESTAMP value. See Impala Date and Time Functions for details.

    • A new built-in function, TRUNC(), truncates date/time values to a particular granularity, such as year, month, day, hour, and so on. See Impala Date and Time Functions for details.

    • ADD_MONTHS() built-in function, an alias for the existing MONTHS_ADD() function. See Impala Date and Time Functions for details.

    • A new built-in function, ROUND(), rounds DECIMAL values to a specified number of fractional digits. See Impala Mathematical Functions for details.

    • Several built-in aggregate functions for computing properties for statistical distributions: STDDEV(), STDDEV_SAMP(), STDDEV_POP(), VARIANCE(), VARIANCE_SAMP(), and VARIANCE_POP(). See STDDEV, STDDEV_SAMP, STDDEV_POP Functions and VARIANCE, VARIANCE_SAMP, VARIANCE_POP, VAR_SAMP, VAR_POP Functions for details.

    • Several new built-in functions, such as MAX_INT(), MIN_SMALLINT(), and so on, let you conveniently check whether data values are in an expected range. You might be able to switch a column to a smaller type, saving memory during processing. See Impala Mathematical Functions for details.

    • New built-in functions, IS_INF() and IS_NAN(), check for the special values infinity and "not a number". These values could be specified as inf or nan in text data files, or be produced by certain arithmetic expressions. See Impala Mathematical Functions for details.

  • The SHOW PARTITIONS statement displays information about the structure of a partitioned table. See SHOW Statement for details.

  • New configuration options for the impalad daemon let you specify initial memory usage for all queries. The initial resource requests handled by Llama and YARN can be expanded later if needed, avoiding unnecessary over-allocation and reducing the chance of out-of-memory conditions. See Integrated Resource Management with YARN for details.

  • Impala can take advantage of the Llama high availability feature in CDH 5.1, for improved reliability of resource management through YARN. See Using Impala with a Llama High Availability Configuration for details.

  • The Impala CREATE TABLE statement now has a STORED AS AVRO clause, allowing you to create Avro tables through Impala. See Using the Avro File Format with Impala Tables for details and examples.
  • New impalad configuration options let you fine-tune the calculations Impala makes to estimate resource requirements for each query. These options can help avoid problems due to overconsumption due to too-low estimates, or underutilization due to too-high estimates. See Integrated Resource Management with YARN for details.

  • A new SUMMARY command in the impala-shell interpreter provides a high-level summary of the work performed at each stage of the explain plan. The summary is also included in output from the PROFILE command. See impala-shell Command Reference and Using the SUMMARY Report for Performance Tuning for details.

  • Performance improvements for the COMPUTE STATS statement:

    • The NDV function is speeded up through native code generation.
    • Because the NULL count is not currently used by the Impala query planner, in Impala 1.4.0 and higher, COMPUTE STATS does not count the NULL values for each column. (The #Nulls field of the stats table is left as -1, signifying that the value is unknown.)

    See COMPUTE STATS Statement for general details about the COMPUTE STATS statement, and How Impala Uses Statistics for Query Optimization for how to use the statistics to improve query performance.

  • Performance improvements for partition pruning. This feature reduces the time spent in query planning, for partitioned tables with thousands of partitions. Previously, Impala typically queried tables with up to approximately 3000 partitions. With the performance improvement in partition pruning, now Impala can comfortably handle tables with tens of thousands of partitions. See Partition Pruning for Queries for information about partition pruning.

  • The documentation provides additional guidance for planning tasks. See Planning for Impala Deployment. In particular, see Cluster Sizing Guidelines for Impala before you purchase or repurpose hardware for a cluster to run Impala.

  • The impala-shell interpreter now supports UTF-8 characters for input and output. You can control whether impala-shell ignores invalid Unicode code points through the --strict_unicode option. (Although this option is removed in Impala 2.0.)

New Features in Impala Version 1.3.2 / CDH 5.0.4

No new features. This point release is exclusively a bug fix release for the IMPALA-1019 issue related to HDFS caching.

  Note: Impala 1.3.2 is only available as part of CDH 5.0.4, not under CDH 4.

New Features in Impala Version 1.3.1 / CDH 5.0.3

This point release is primarily a vehicle to deliver bug fixes. Any new features are minor changes resulting from fixes for performance, reliability, or usability issues.

Because 1.3.1 is the first 1.3.x release for CDH 4, if you are on CDH 4, also consult New Features in Impala Version 1.3.0 / CDH 5.0.0 for more features that are new to you.

  Note:
  • The Impala 1.3.1 release is available for both CDH 4 and CDH 5. This is the first release in the 1.3.x series for CDH 4.
  • A new impalad startup option, --insert_inherit_permissions, causes Impala INSERT statements to create each new partition with the same HDFS permissions as its parent directory. By default, INSERT statements create directories for new partitions using default HDFS permissions. See INSERT Statement for examples of INSERT statements for partitioned tables.

  • The SHOW FUNCTIONS statement now displays the return type of each function, in addition to the types of its arguments. See SHOW Statement for examples.

  • You can now specify the clause FIELDS TERMINATED BY '\0' with a CREATE TABLE statement to use text data files that use ASCII 0 (nul) characters as a delimiter. See Using Text Data Files with Impala Tables for details.

  • In Impala 1.3.1 and higher, the REGEXP and RLIKE operators now match a regular expression string that occurs anywhere inside the target string, the same as if the regular expression was enclosed on each side by .*. See REGEXP Operator for examples. Previously, these operators only succeeded when the regular expression matched the entire target string. This change improves compatibility with the regular expression support for popular database systems. There is no change to the behavior of the regexp_extract() and regexp_replace() built-in functions.

New Features in Impala Version 1.3.0 / CDH 5.0.0

  Note:
  • The Impala 1.3.1 release is available for both CDH 4 and CDH 5. This is the first release in the 1.3.x series for CDH 4.
  • The admission control feature lets you control and prioritize the volume and resource consumption of concurrent queries. This mechanism reduces spikes in resource usage, helping Impala to run alongside other kinds of workloads on a busy cluster. It also provides more user-friendly conflict resolution when multiple memory-intensive queries are submitted concurrently, avoiding resource contention that formerly resulted in out-of-memory errors. See Admission Control and Query Queuing for details.

  • Enhanced EXPLAIN plans provide more detail in an easier-to-read format. Now there are four levels of verbosity: the EXPLAIN_LEVEL option can be set from 0 (most concise) to 3 (most verbose). See EXPLAIN Statement for syntax and Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles for usage information.

  • The TIMESTAMP data type accepts more kinds of input string formats through the UNIX_TIMESTAMP function, and produces more varieties of string formats through the FROM_UNIXTIME function. The documentation now also lists more functions for date arithmetic, used for adding and subtracting INTERVAL expressions from TIMESTAMP values. See Impala Date and Time Functions for details.

  • New conditional functions, NULLIF(), NULLIFZERO(), and ZEROIFNULL(), simplify porting SQL containing vendor extensions to Impala. See Impala Conditional Functions for details.

  • New utility function, CURRENT_DATABASE(). See Miscellaneous Functions for details.

  • Integration with the YARN resource management framework. Only available in combination with CDH 5. This feature makes use of the underlying YARN service, plus an additional service (Llama) that coordinates requests to YARN for Impala resources, so that the Impala query only proceeds when all requested resources are available. See Integrated Resource Management with YARN for full details.

    On the Impala side, this feature involves some new startup options for the impalad daemon:

    • -enable_rm
    • -llama_host
    • -llama_port
    • -llama_callback_port
    • -cgroup_hierarchy_path

    For details of these startup options, see Modifying Impala Startup Options.

    This feature also involves several new or changed query options that you can set through the impala-shell interpreter and apply within a specific session:

    • MEM_LIMIT: the function of this existing option changes when Impala resource management is enabled.
    • REQUEST_POOL: a new option. (Renamed to RESOURCE_POOL in Impala 1.3.0.)
    • V_CPU_CORES: a new option.
    • RESERVATION_REQUEST_TIMEOUT: a new option.

    For details of these query options, see impala-shell Query Options for Resource Management.

New Features in Impala Version 1.2.4

  Note: Impala 1.2.4 works with CDH 4. It is primarily a bug fix release for Impala 1.2.3, plus some performance enhancements for the catalog server to minimize startup and DDL wait times for Impala deployments with large numbers of databases, tables, and partitions.
  • On Impala startup, the metadata loading and synchronization mechanism has been improved and optimized, to give more responsiveness when starting Impala on a system with a large number of databases, tables, or partitions. The initial metadata loading happens in the background, allowing queries to be run before the entire process is finished. When a query refers to a table whose metadata is not yet loaded, the query waits until the metadata for that table is loaded, and the load operation for that table is prioritized to happen first.

  • Formerly, if you created a new table in Hive, you had to issue the INVALIDATE METADATA statement (with no table name) which was an expensive operation that reloaded metadata for all tables. Impala did not recognize the name of the Hive-created table, so you could not do INVALIDATE METADATA new_table to get the metadata for just that one table. Now, when you issue INVALIDATE METADATA table_name, Impala checks to see if that name represents a table created in Hive, and if so recognizes the new table and loads the metadata for it. Additionally, if the new table is in a database that was newly created in Hive, Impala also recognizes the new database.

  • If you issue INVALIDATE METADATA table_name and the table has been dropped through Hive, Impala will recognize that the table no longer exists.

  • New startup options let you control the parallelism of the metadata loading during startup for the catalogd daemon:

    • --load_catalog_in_background makes Impala load and cache metadata using background threads after startup. It is true by default. Previously, a system with a large number of databases, tables, or partitions could be unresponsive or even time out during startup.

    • --num_metadata_loading_threads determines how much parallelism Impala devotes to loading metadata in the background. The default is 16. You might increase this value for systems with huge numbers of databases, tables, or partitions. You might lower this value for busy systems that are CPU-constrained due to jobs from components other than Impala.

New Features in Impala Version 1.2.3

  Note: Impala 1.2.3 works with CDH 4 and with CDH 5 beta 2. The resource management feature requires CDH 5 beta.

Impala 1.2.3 contains exactly the same feature set as Impala 1.2.2. Its only difference is one additional fix for compatibility with Parquet files generated outside of Impala by components such as Hive, Pig, or MapReduce. See Cloudera Impala Known Issues and Workarounds for details of that fix. If you are upgrading from Impala 1.2.1 or earlier, see New Features in Impala Version 1.2.2 for the latest added features.

New Features in Impala Version 1.2.2

  Note: Impala 1.2.2 works with CDH 4. Its feature set is a superset of features in the Impala 1.2.0 beta, with the exception of resource management, which relies on CDH 5.

Impala 1.2.2 includes new features for performance, security, and flexibility. The major enhancements over 1.2.1 are performance related, primarily for join queries.

New user-visible features include:

  • Join order optimizations. This highly valuable feature automatically distributes and parallelizes the work for a join query to minimize disk I/O and network traffic. The automatic optimization reduces the need to use query hints or to rewrite join queries with the tables in a specific order based on size or cardinality. The new COMPUTE STATS statement gathers statistical information about each table that is crucial for enabling the join optimizations. See Performance Considerations for Join Queries for details.

  • COMPUTE STATS statement to collect both table statistics and column statistics with a single statement. Intended to be more comprehensive, efficient, and reliable than the corresponding Hive ANALYZE TABLE statement, which collects statistics in multiple phases through MapReduce jobs. These statistics are important for query planning for join queries, queries on partitioned tables, and other types of data-intensive operations. For optimal planning of join queries, you need to collect statistics for each table involved in the join. See COMPUTE STATS Statement for details.

  • Reordering of tables in a join query can be overridden by the STRAIGHT_JOIN operator, allowing you to fine-tune the planning of the join query if necessary, by using the original technique of ordering the joined tables in descending order of size. See Overriding Join Reordering with STRAIGHT_JOIN for details.

  • The CROSS JOIN clause in the SELECT statement to allow Cartesian products in queries, that is, joins without an equality comparison between columns in both tables. Because such queries must be carefully checked to avoid accidental overconsumption of memory, you must use the CROSS JOIN operator to explicitly select this kind of join. See Cross Joins and Cartesian Products with the CROSS JOIN Operator for examples.

  • The ALTER TABLE statement has new clauses that let you fine-tune table statistics. You can use this technique as a less-expensive way to update specific statistics, in case the statistics become stale, or to experiment with the effects of different data distributions on query planning.

  • LDAP username/password authentication in JDBC/ODBC. See Enabling LDAP Authentication for Impala for details.

  • GROUP_CONCAT() aggregate function to concatenate column values across all rows of a result set.

  • The INSERT statement now accepts hints, [SHUFFLE] and [NOSHUFFLE], to influence the way work is redistributed during INSERT...SELECT operations. The hints are primarily useful for inserting into partitioned Parquet tables, where using the [SHUFFLE] hint can avoid problems due to memory consumption and simultaneous open files in HDFS, by collecting all the new data for each partition on a specific node.

  • Several built-in functions and operators are now overloaded for more numeric data types, to reduce the requirement to use CAST() for type coercion in INSERT statements. For example, the expression 2+2 in an INSERT statement formerly produced a BIGINT result, requiring a CAST() to be stored in an INT variable. Now, addition, subtraction, and multiplication only produce a result that is one step "bigger" than their arguments, and numeric and conditional functions can return SMALLINT, FLOAT, and other smaller types rather than always BIGINT or DOUBLE.

  • New fnv_hash() built-in function for constructing hashed values. See Impala Mathematical Functions for details.

  • The clause STORED AS PARQUET is accepted as an equivalent for STORED AS PARQUETFILE. This more concise form is recommended for new code.

Because Impala 1.2.2 builds on a number of features introduced in 1.2.1, if you are upgrading from an older 1.1.x release straight to 1.2.2, also review New Features in Impala Version 1.2.1 to see features such as the SHOW TABLE STATS and SHOW COLUMN STATS statements, and user-defined functions (UDFs).

New Features in Impala Version 1.2.1

  Note: Impala 1.2.1 works with CDH 4. Its feature set is a superset of features in the Impala 1.2.0 beta, with the exception of resource management, which relies on CDH 5.

Impala 1.2.1 includes new features for security, performance, and flexibility.

New user-visible features include:

  • SHOW TABLE STATS table_name and SHOW COLUMN STATS table_name statements, to verify that statistics are available and to see the values used during query planning.

  • CREATE TABLE AS SELECT syntax, to create a new table and transfer data into it in a single operation.

  • OFFSET clause, for use with the ORDER BY and LIMIT clauses to produce "paged" result sets such as items 1-10, then 11-20, and so on.

  • NULLS FIRST and NULLS LAST clauses to ensure consistent placement of NULL values in ORDER BY queries.

  • New built-in functions: least(), greatest(), initcap().

  • New aggregate function: ndv(), a fast alternative to COUNT(DISTINCT col) returning an approximate result.

  • The LIMIT clause can now accept a numeric expression as an argument, rather than only a literal constant.

  • The SHOW CREATE TABLE statement displays the end result of all the CREATE TABLE and ALTER TABLE statements for a particular table. You can use the output to produce a simplified setup script for a schema.

  • The --idle_query_timeout and --idle_session_timeout options for impalad control the time intervals after which idle queries are cancelled, and idle sessions expire. See Setting Timeout Periods for Daemons, Queries, and Sessions for details.

  • User-defined functions (UDFs). This feature lets you transform data in very flexible ways, which is important when using Impala as part of an ETL or ELT pipeline. Prior to Impala 1.2, using UDFs required switching into Hive. Impala 1.2 can run scalar UDFs and user-defined aggregate functions (UDAs). Impala can run high-performance functions written in C++, or you can reuse existing Hive functions written in Java.

    You create UDFs through the CREATE FUNCTION statement and drop them through the DROP FUNCTION statement. See User-Defined Functions (UDFs) for instructions about coding, building, and deploying UDFs, and CREATE FUNCTION Statement and DROP FUNCTION Statement for related SQL syntax.

  • A new service automatically propagates changes to table data and metadata made by one Impala node, sending the new or updated metadata to all the other Impala nodes. The automatic synchronization mechanism eliminates the need to use the INVALIDATE METADATA and REFRESH statements after issuing Impala statements such as CREATE TABLE, ALTER TABLE, DROP TABLE, INSERT, and LOAD DATA.

    For even more precise synchronization, you can enable the SYNC_DDL query option before issuing a DDL, INSERT, or LOAD DATA statement. This option causes the statement to wait, returning only after the catalog service has broadcast the applicable changes to all Impala nodes in the cluster.

      Note:

    Because the catalog service only monitors operations performed through Impala, INVALIDATE METADATA and REFRESH are still needed on the Impala side after creating new tables or loading data through the Hive shell or by manipulating data files directly in HDFS. Because the catalog service broadcasts the result of the REFRESH and INVALIDATE METADATA statements to all Impala nodes, when you do need to use those statements, you can do so a single time rather than on every Impala node.

    This service is implemented by the catalogd daemon. See The Impala Catalog Service for details.

  • CREATE TABLE ... AS SELECT syntax, to create a table and copy data into it in a single operation. See CREATE TABLE Statement for details.

  • The CREATE TABLE and ALTER TABLE statements have new clauses TBLPROPERTIES and WITH SERDEPROPERTIES. The TBLPROPERTIES clause lets you associate arbitrary items of metadata with a particular table as key-value pairs. The WITH SERDEPROPERTIES clause lets you specify the serializer/deserializer (SerDes) classes that read and write data for a table; although Impala does not make use of these properties, sometimes particular values are needed for Hive compatibility. See CREATE TABLE Statement and ALTER TABLE Statement for details.

  • Impersonation support lets you authorize certain OS users associated with applications (for example, hue), to submit requests using the credentials of other users. Only available in combination with CDH 5. See Configuring Per-User Access for Hue for details.

  • Enhancements to EXPLAIN output. In particular, when you enable the new EXPLAIN_LEVEL query option, the EXPLAIN and PROFILE statements produce more verbose output showing estimated resource requirements and whether table and column statistics are available for the applicable tables and columns. See EXPLAIN Statement for details.

  • SHOW CREATE TABLE summarizes the effects of the original CREATE TABLE statement and any subsequent ALTER TABLE statements, giving you a CREATE TABLE statement that will re-create the current structure and layout for a table.

  • The LIMIT clause for queries now accepts an arithmetic expression, in addition to numeric literals.

New Features in Impala Version 1.2.0 (Beta)

  Note: The Impala 1.2.0 beta release only works in combination with the beta version of CDH 5. The Impala 1.2.0 software is bundled together with the CDH 5 beta 1 download.

The Impala 1.2.0 beta includes new features for security, performance, and flexibility.

New user-visible features include:

  • User-defined functions (UDFs). This feature lets you transform data in very flexible ways, which is important when using Impala as part of an ETL or ELT pipeline. Prior to Impala 1.2, using UDFs required switching into Hive. Impala 1.2 can run scalar UDFs and user-defined aggregate functions (UDAs). Impala can run high-performance functions written in C++, or you can reuse existing Hive functions written in Java.

    You create UDFs through the CREATE FUNCTION statement and drop them through the DROP FUNCTION statement. See User-Defined Functions (UDFs) for instructions about coding, building, and deploying UDFs, and CREATE FUNCTION Statement and DROP FUNCTION Statement for related SQL syntax.

  • A new service automatically propagates changes to table data and metadata made by one Impala node, sending the new or updated metadata to all the other Impala nodes. The automatic synchronization mechanism eliminates the need to use the INVALIDATE METADATA and REFRESH statements after issuing Impala statements such as CREATE TABLE, ALTER TABLE, DROP TABLE, INSERT, and LOAD DATA.

      Note:

    Because this service only monitors operations performed through Impala, INVALIDATE METADATA and REFRESH are still needed on the Impala side after creating new tables or loading data through the Hive shell or by manipulating data files directly in HDFS. Because the catalog service broadcasts the result of the REFRESH and INVALIDATE METADATA statements to all Impala nodes, when you do need to use those statements, you can do so a single time rather than on every Impala node.

    This service is implemented by the catalogd daemon. See The Impala Catalog Service for details.

  • Integration with the YARN resource management framework. Only available in combination with CDH 5. This feature makes use of the underlying YARN service, plus an additional service (Llama) that coordinates requests to YARN for Impala resources, so that the Impala query only proceeds when all requested resources are available. See Integrated Resource Management with YARN for full details.

    On the Impala side, this feature involves some new startup options for the impalad daemon:

    • -enable_rm
    • -llama_host
    • -llama_port
    • -llama_callback_port
    • -cgroup_hierarchy_path

    For details of these startup options, see Modifying Impala Startup Options.

    This feature also involves several new or changed query options that you can set through the impala-shell interpreter and apply within a specific session:

    • MEM_LIMIT: the function of this existing option changes when Impala resource management is enabled.
    • YARN_POOL: a new option. (Renamed to RESOURCE_POOL in Impala 1.3.0.)
    • V_CPU_CORES: a new option.
    • RESERVATION_REQUEST_TIMEOUT: a new option.

    For details of these query options, see impala-shell Query Options for Resource Management.

  • CREATE TABLE ... AS SELECT syntax, to create a table and copy data into it in a single operation. See CREATE TABLE Statement for details.

  • The CREATE TABLE and ALTER TABLE statements have a new TBLPROPERTIES clause that lets you associate arbitrary items of metadata with a particular table as key-value pairs. See CREATE TABLE Statement and ALTER TABLE Statement for details.

  • Impersonation support lets you authorize certain OS users associated with applications (for example, hue), to submit requests using the credentials of other users. Only available in combination with CDH 5. See Configuring Per-User Access for Hue for details.

  • Enhancements to EXPLAIN output. In particular, when you enable the new EXPLAIN_LEVEL query option, the EXPLAIN and PROFILE statements produce more verbose output showing estimated resource requirements and whether table and column statistics are available for the applicable tables and columns. See EXPLAIN Statement for details.

New Features in Impala Version 1.1.1

Impala 1.1.1 includes new features for security and stability.

New user-visible features include:

  • Additional security feature: auditing. New startup options for impalad let you capture information about Impala queries that succeed or are blocked due to insufficient privileges. To take full advantage of this feature with Cloudera Manager, upgrade to Cloudera Manager 4.7 or later. For details, see Overview of Impala Security .
  • Parquet data files generated by Impala 1.1.1 are now compatible with the Parquet support in Hive. See Cloudera Impala Incompatible Changes for the procedure to update older Impala-created Parquet files to be compatible with the Hive Parquet support.
  • Additional improvements to stability and resource utilization for Impala queries.
  • Additional enhancements for compatibility with existing file formats.

New Features in Impala Version 1.1

Impala 1.1 includes new features for security, performance, and usability.

New user-visible features include:

  • Extensive new security features, built on top of the Sentry open source project. Impala now supports fine-grained authorization based on roles. A policy file determines which privileges on which schema objects (servers, databases, tables, and HDFS paths) are available to users based on their membership in groups. By assigning privileges for views, you can control access to table data at the column level. For details, see Overview of Impala Security .
  • Impala 1.1 works with Cloudera Manager 4.6 or later. To use Cloudera Manager to manage authorization for the Impala web UI (the web pages served from port 25000 by default), use Cloudera Manager 4.6.2 or later.
  • Impala can now create, alter, drop, and query views. Views provide a flexible way to set up simple aliases for complex queries; hide query details from applications and users; and simplify maintenance as you rename or reorganize databases, tables, and columns. See the overview section Views and the statements CREATE VIEW Statement, ALTER VIEW Statement, and DROP VIEW Statement.
  • Performance is improved through a number of automatic optimizations. Resource consumption is also reduced for Impala queries. These improvements apply broadly across all kinds of workloads and file formats. The major areas of performance enhancement include:
    • Improved disk and thread scheduling, which applies to all queries.
    • Improved hash join and aggregation performance, which applies to queries with large build tables or a large number of groups.
    • Dictionary encoding with Parquet, which applies to Parquet tables with short string columns.
    • Improved performance on systems with SSDs, which applies to all queries and file formats.
  • Some new built-in functions are implemented: translate() to substitute characters within strings, user() to check the login ID of the connected user.
  • The new WITH clause for SELECT statements lets you simplify complicated queries in a way similar to creating a view. The effects of the WITH clause only last for the duration of one query, unlike views, which are persistent schema objects that can be used by multiple sessions or applications. See WITH Clause.
  • An enhancement to DESCRIBE statement, DESCRIBE FORMATTED table_name, displays more detailed information about the table. This information includes the file format, location, delimiter, ownership, external or internal, creation and access times, and partitions. The information is returned as a result set that can be interpreted and used by a management or monitoring application. See DESCRIBE Statement.
  • You can now insert a subset of columns for a table, with other columns being left as all NULL values. Or you can specify the columns in any order in the destination table, rather than having to match the order of the corresponding columns in the source. VALUES clause. This feature is known as "column permutation". See INSERT Statement.
  • The new LOAD DATA statement lets you load data into a table directly from an HDFS data file. This technique lets you minimize the number of steps in your ETL process, and provides more flexibility. For example, you can bring data into an Impala table in one step. Formerly, you might have created an external table where the data files are not entirely under your control, or copied the data files to Impala data directories manually, or loaded the original data into one table and then used the INSERT statement to copy it to a new table with a different file format, partitioning scheme, and so on. See LOAD DATA Statement.
  • Improvements to Impala-HBase integration:
  • You can issue REFRESH as a SQL statement through any of the programming interfaces that Impala supports. REFRESH formerly had to be issued as a command through the impala-shell interpreter, and was not available through a JDBC or ODBC API call. As part of this change, the functionality of the REFRESH statement is divided between two statements. In Impala 1.1, REFRESH requires a table name argument and immediately reloads the metadata; the new INVALIDATE METADATA statement works the same as the Impala 1.0 REFRESH did: the table name argument is optional, and the metadata for one or all tables is marked as stale, but not actually reloaded until the table is queried. When you create a new table in the Hive shell or through a different Impala node, you must enter INVALIDATE METADATA with no table parameter before you can see the new table in impala-shell. See REFRESH Statement and INVALIDATE METADATA Statement.

New Features in Impala Version 1.0.1

The primary enhancements in Impala 1.0.1 are internal, for compatibility with the new Cloudera Manager 4.6 release. Try out the new Impala Query Monitoring feature in Cloudera Manager 4.6, which requires Impala 1.0.1.

New user-visible features include:

  • The VALUES clause lets you INSERT one or more rows using literals, function return values, or other expressions. For performance and scalability, you should still use INSERT ... SELECT for bringing large quantities of data into an Impala table. The VALUES clause is a convenient way to set up small tables, particularly for initial testing of SQL features that do not require large amounts of data. See VALUES Clause for details.
  • The -B and -o options of the impala-shell command can turn query results into delimited text files and store them in an output file. The plain text results are useful for using with other Hadoop components or Unix tools. In benchmark tests, it is also faster to produce plain rather than pretty-printed results, and write to a file rather than to the screen, giving a more accurate picture of the actual query time.
  • Several bug fixes. See Issues Fixed in the 1.0.1 Release for details.

New Features in Impala Version 1.0

This version has multiple performance improvements and adds the following functionality:

New Features in Version 0.7 of the Cloudera Impala Beta Release

This version has multiple performance improvements and adds the following functionality:

  • Several bug fixes. See Issues Fixed in Version 0.7 of the Beta Release.
  • Support for the Parquet file format. For more information on file formats, see How Impala Works with Hadoop File Formats.
  • Added support for Avro.
  • Support for the memory limits. For more information, see the example on modifying memory limits in Modifying Impala Startup Options.
  • Bigger and faster joins through the addition of partitioned joins to the already supported broadcast joins.
  • Fully distributed aggregations.
  • Fully distributed top-n computation.
  • Support for creating and altering tables.
  • Support for GROUP BY with floats and doubles.

In this version, both CDH 4.1 and 4.2 are supported, but due to performance improvements added, we highly recommend you use CDH 4.2 or later to see the full benefit. If you are using Cloudera Manager, version 4.5 is required.

New Features in Version 0.6 of the Cloudera Impala Beta Release

  • Several bug fixes. See Issues Fixed in Version 0.6 of the Beta Release.
  • Added support for Impala on SUSE and Debian/Ubuntu. Impala is now supported on:
    • RHEL5.7/6.2 and Centos5.7/6.2
    • SUSE 11 with Service Pack 1 or later
    • Ubuntu 10.04/12.04 and Debian 6.03
  • Cloudera Manager 4.5 and CDH 4.2 support Impala 0.6.
  • Support for the RCFile file format. For more information on file formats, see Understanding File Formats.

New Features in Version 0.5 of the Cloudera Impala Beta Release

New Features in Version 0.4 of the Cloudera Impala Beta Release

  • Several bug fixes. See Issues Fixed in Version 0.4 of the Beta Release.
  • Added support for Impala on RHEL5.7/Centos5.7. Impala is now supported on RHEL5.7/6.2 and Centos5.7/6.2.
  • Cloudera Manager 4.1.3 supports Impala 0.4.
  • The Impala debug webserver now has the ability to serve static files from ${IMPALA_HOME}/www. This can be disabled by setting --enable_webserver_doc_root=false on the command line. As a result, Impala now uses the Twitter Bootstrap library to style its debug webpages, and the /queries page now tracks the last 25 queries run by each Impala daemon.
  • Additional metrics available on the Impala Debug Webpage.

New Features in Version 0.3 of the Cloudera Impala Beta Release

  • Several bug fixes. See Issues Fixed in Version 0.3 of the Beta Release.
  • The state-store-service binary has been renamed statestored.
  • The location of the Impala configuration files has changed from the /usr/lib/impala/conf directory to the /etc/impala/conf directory.

New Features in Version 0.2 of the Cloudera Impala Beta Release

  • Several bug fixes. See Issues Fixed in Version 0.2 of the Beta Release.
  • Added Default Query Options Default query options override all default QueryOption values when starting impalad. The format is:
    -default_query_options='key=value;key=value'