Installing and Using Cloudera Impala

Cloudera Impala™ provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for real-time or batch-oriented queries.

Cloudera Impala is an addition to tools available for querying big data. Impala does not replace the batch processing frameworks built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are best suited for long running batch jobs, such as those involving batch processing of Extract, Transform, and Load (ETL) type jobs.

Impala Benefits

Impala provides:

  • Familiar SQL interface that data scientists and analysts already know
  • Ability to interactively query data on big data in Apache Hadoop
  • Single system for big data processing and analytics so customers can avoid costly modeling and ETL just for analytics

How Cloudera Impala Works with CDH

The following graphic illustrates how Impala is positioned in the broader Cloudera environment.
images/image1.jpeg

The Impala solution is composed of the following components:

  • Clients - Entities including Hue, ODBC clients, JDBC clients, and the Impala Shell can all interact with Impala. These interfaces are typically used to issue queries or complete administrative tasks such as connecting to Impala.
  • Hive Metastore - Stores information about the data available to Impala. For example, the metastore lets Impala know what databases are available and what the structure of those databases is.
  • Cloudera Impala - This process, which runs on datanodes, coordinates and executes queries. Each instance of Impala can receive, plan, and coordinate queries from Impala clients. Queries are distributed among Impala nodes, and these nodes then act as workers, executing parallel query fragments.
  • HBase and HDFS - Storage for data to be queried.

Queries executed using Impala are handled as follows:

  1. User applications send SQL queries to Impala through ODBC or JDBC, which provide standardized querying interfaces. The user application may connect to any impaladin the cluster. This impalad becomes the coordinator for the query.
  2. Impala parses the query and analyzes it to determine what tasks need to be performed by impaladinstances across the cluster. Execution is planned for optimal efficiency.
  3. Services such as HDFS and HBase are accessed by local impalad instances to provide data.
  4. Each impaladreturns data to the coordinating impalad, which sends these results to the client.

Primary Impala Features

Impala provides support for: