Apache Impala - Interactive SQL

The Apache Impala project provides high-performance, low-latency SQL queries on data stored in popular Apache Hadoop file formats. The fast response for queries enables interactive exploration and fine-tuning of analytic queries, rather than long batch jobs traditionally associated with SQL-on-Hadoop technologies. (You will often see the term "interactive" applied to these kinds of fast queries with human-scale response times.)

Impala integrates with the Apache Hive metastore database, to share databases and tables between both components. The high level of integration with Hive, and compatibility with the HiveQL syntax, lets you use either Impala or Hive to create tables, issue queries, load data, and so on.

The following are some of the key advantages of Impala:
  • Impala integrates with the existing CDH ecosystem, meaning data can be stored, shared, and accessed using the various solutions included with CDH. This also avoids data silos and minimizes expensive data movement.
  • Impala provides access to data stored in CDH without requiring the Java skills required for MapReduce jobs. Impala can access data directly from the HDFS file system. Impala also provides a SQL front-end to access data in the HBase database system, or in the Amazon Simple Storage System (S3).
  • Impala returns results typically within seconds or a few minutes, rather than the many minutes or hours that are often required for Hive queries to complete.
  • Impala is pioneering the use of the Parquet file format, a columnar storage layout that is optimized for large-scale queries typical in data warehouse scenarios.

Related information throughout the CDH 5 library:

In CDH 5, the Impala documentation for Release Notes, Installation, Upgrading, and Security has been integrated alongside the corresponding information for other Hadoop components: