Cloudera Impala Frequently Asked Questions
How do I try Cloudera Impala out?
To look at the core features and functionality on Impala, the easiest way to try out Impala is to download the Cloudera QuickStart VM and start the Impala service through Cloudera Manager, then use impala-shell in a terminal window or the Impala Query UI in the Hue web interface.
To do performance testing and try out the management features for Impala on a cluster, you need to move beyond the QuickStart VM with its virtualized single-node environment. Ideally, download the Cloudera Manager software to set up the cluster, then install the Impala software through Cloudera Manager.
Does Cloudera offer a VM for demonstrating Impala?
Cloudera offers a demonstration VM called the QuickStart VM, available in VMWare, VirtualBox, and KVM formats. For more information, see the Cloudera QuickStart VM. After booting the QuickStart VM, many services are turned off by default; in the Cloudera Manager UI that appears automatically, turn on Impala and any other components that you want to try out.
Where can I find Impala documentation?
Where can I get more information about Impala?
More product information is available here:
- O'Reilly e-book: Cloudera Impala: Bringing the SQL and Hadoop Worlds Together
- Blog: Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real
- Webinar: Introduction to Impala
- Product website page: Cloudera Enterprise RTQ
Impala System Requirements
What are the software and hardware requirements for running Impala?
For information on Impala requirements, see Cloudera Impala Requirements. Note that Cloudera recommends using Cloudera Manager 4.6 or higher with Impala 1.1.
How much memory is required?
Although Impala is not an in-memory database, when dealing with large tables and large result sets, you should expect to dedicate a substantial portion of physical memory for the impalad daemon. Recommended physical memory for an Impala node is 128 GB or higher. The amount of memory required for an Impala operation depends on several factors:
- The file format of the table. Different file formats represent the same data in more or fewer data files. The compression and encoding for each file format might require a different amount of temporary memory to decompress the data for analysis.
- Whether the operation is a SELECT or an INSERT. For example, Parquet tables require relatively little memory to query, because Impala reads and decompresses data in 8MB chunks. Inserting into a Parquet table is a more memory-intensive operation because the data for each data file (with a maximum size of 1GB) is stored in memory until encoded, compressed, and written to disk.
- Whether the table is partitioned or not, and whether a query against a partitioned table can take advantage of partition pruning.
- The size of the result set.
- The mechanism by which work is divided for a join query.
Impala currently does not "spill to disk" if intermediate results being processed on a node exceed the memory reserved for Impala on that node. If this is an issue for your use case (for example, joins between very large tables), more memory will be beneficial.
See Hardware Requirements for more details and recommendations about Impala hardware prerequisites.
What processor type and speed does Cloudera recommend?
Impala makes use of SSE4.2 instructions. This corresponds to Nehalem+ for Intel chips and Bulldozer+ for AMD chips. Impala runs perfectly fine on machines that are older but will not achieve the best performance.
Supported and Unsupported Functionality In Impala
Impala supports the following functionality:
- A large subset of SQL and HiveQL commands, including SELECT and INSERT, with joins. For more information, see Impala SQL Language Reference.
- Using Cloudera Manager to manage Impala. Using Cloudera Manager 4.6 or later, you can deploy and manage your Impala services. Cloudera Manager is the best way to get started with Impala on your cluster. For more information, see the topic on Installing Impala with Cloudera Manager in the Cloudera Manager Installation Guide.
- Using Hue for queries.
- Appending and inserting data into tables through the INSERT statement. See How Impala Works with Hadoop File Formats for the details about which operations are supported for which file formats.
- ODBC: Impala is certified to run against MicroStrategy and Tableau, with restrictions. For more information, see Configuring Impala to Work with ODBC.
- Querying data stored in HDFS and HBase in a single query. See Using Impala to Query HBase Tables for details.
- Concurrent client requests. Each Impala daemon can handle multiple concurrent client requests. The effects on performance depend on your particular hardware and workload.
- Kerberos authentication. For more information, see Impala Security.
- Partitions. With Impala SQL, you can create partitioned tables with the CREATE TABLE statement, and add and drop partitions with the ALTER TABLE statement. Impala also takes advantage of the partitioning present in Hive tables. See Partitioning for details.
Impala does not support the following functionality:
- Querying streaming data.
- Deleting individual rows. You delete data in bulk by overwriting an entire table or partition, or by dropping a table.
- Indexing (not currently). LZO-compressed text files can be indexed outside of Impala, as described in Using LZO-Compressed Text Files.
- Full text search on text fields. The Cloudera Search product is appropriate for this use case.
- Custom Hive Serializer/Deserializer classes (SerDes). Impala supports a set of common native file formats that have built-in SerDes in CDH. See How Impala Works with Hadoop File Formats for details.
- Failover for running queries. Currently, Impala cancels a running query if any host on which that query is executing fails. When one or more hosts are down, Impala reroutes future queries to only use the available hosts, and Impala detects when the hosts come back up and begins using them again. Because a query can be submitted through any Impala node, there is no single point of failure. In the future, we will consider adding additional work allocation features to Impala, so that a running query would complete even in the presence of host failures.
- Encryption of data transmitted between Impala daemons.
- Window functions.
- Hive indexes.
- Non-Hadoop data stores, such as relational databases.
For the detailed list of unsupported HiveQL features, see SQL Differences Between Impala and Hive.
Roadmap of New Functionality In Impala
The following information describes the plans for new functionality in Impala in the future, but the information is subject to change.
Does Impala support generic JDBC?
Impala supports the HiveServer2 JDBC driver.
Is Avro supported?
Yes, Avro is supported. Impala can query Avro tables. Currently, you must create such tables and load the data within Hive. See Using the Avro File Format with Impala Tables for details.
What's next for Cloudera Impala?
See our blog post: http://blog.cloudera.com/blog/2012/12/whats-next-for-cloudera-impala/
Impala Use Cases
What are good Impala use cases? Under what conditions should Impala or Hive/MapReduce be used?
Impala is well-suited to executing SQL queries for interactive exploratory analytics on large datasets. Hive and MapReduce are appropriate for very long running, batch-oriented tasks such as ETL.
Is MapReduce required for Impala? Will Impala continue to work as expected if MapReduce is stopped?
Impala does not use MapReduce at all.
How do I import or use my existing data from Hive?
Impala does not require any import or conversion operation for data that already exists in Hive. To run an Impala query against a data set, simply create the table in Impala or in Hive. When you create a table in Hive while Impala is already running, refresh the Impala metadata cache with the INVALIDATE METADATA statement to make Impala aware of the new table. If you load new data into the table through Hive or through manual HDFS operations, issue a REFRESH table_name statement in Impala to make Impala aware of the new data files.
Can Impala be used for complex event processing?
For example, in an industrial environment, many agents may generate large amounts of data. Can Impala be used to analyze this data, checking for notable changes in the environment?
Complex Event Processing (CEP) is usually performed by dedicated stream-processing systems. Impala is not a stream-processing system, as it most closely resembles a relational database.
Is Impala intended to handle real time queries in low-latency applications or is it for ad hoc queries for the purpose of data exploration?
Ad-hoc queries are the primary use case for Impala. We anticipate it being used in many other situations where low-latency is required. Whether Impala is appropriate for any particular use-case depends on the workload, data size and query volume. See Impala Benefits for the primary benefits you can expect when using Impala.
Questions about Impala And Hive
How does Impala compare to Hive and Pig?
Impala is different from Hive and Pig because it uses its own daemons that are spread across the cluster for queries. Because Impala does not rely on MapReduce, it avoids the startup overhead of MapReduce jobs, allowing Impala to return results in real time.
Can I do transforms or add new functionality?
Impala adds support for UDFs in Impala 1.2. You can write your own functions in C++, or reuse existing Java-based Hive UDFs. The UDF support includes scalar functions and user-defined aggregate functions (UDAs). User-defined table functions (UDTFs) are not currently supported.
Impala does not currently support an extensible serialization-deserialization framework (SerDes), and so adding extra functionality to Impala is not as straightforward as for Hive or Pig.
Can any Impala query also be executed in Hive?
Yes. There are some minor differences in how some queries are handled, but Impala queries can also be completed in Hive. Impala SQL is a subset of HiveQL, with some functional limitations such as transforms. For details of the Impala SQL dialect, see Impala SQL Language Elements. For the Impala built-in functions, see Built-in Function Support. For the detailed list of unsupported HiveQL features, see SQL Differences Between Impala and Hive.
Can I use Impala to query data already loaded into Hive and HBase? Or are there special steps that must be taken to use Impala to query data in Hive or HBase?
There are no additional steps to allow Impala to query tables managed by Hive, whether they are stored in HDFS or HBase. Make sure that Impala is configured to access the Hive metastore correctly and you should be ready to go. Keep in mind that impalad, by default, runs as the impala user, so you might need to adjust some file permissions depending on how strict your permissions are currently.
See Using Impala to Query HBase Tables for details about querying data in HBase.
Is Hive an Impala requirement?
The Hive metastore service is a requirement. Impala shares the same metastore database as Hive, allowing Impala and Hive to access the same tables transparently.
Hive itself is optional, and does not need to be installed on the same nodes as Impala. Currently, Impala supports a wider variety of read (query) operations than write (insert) operations; you use Hive to insert data into tables that use certain file formats. See How Impala Works with Hadoop File Formats for details.
How do I?
How do I configure Hadoop High Availability (HA) for Impala?
For instructions, see the topic on Upgrading the Hive Metastore to use HDFS HA in the CDH4 High Availability Guide.
Is Impala production ready?
Impala has finished its beta release cycle, and the 1.0 GA release is production ready. The 1.1 release includes additional security features for authorization, an important requirement for production use in many organizations. Some Cloudera customers are already using Impala for large workloads.
The latest Impala 1.2.0 release is currently in beta, because it uses some features only available in the beta release of CDH 5.
Are there single points of failure in Impala? What happens if there is an error?
The short answer: No, there is not a single point of failure in Impala. All Impala daemons are fully able to handle incoming queries. If a machine fails however, all queries with fragments running on that machine will fail. Because queries are expected to return quickly, you can just rerun the query if there is a failure. See Impala Concepts and Architecture for details about the Impala architecture.
The longer answer: Impala must be able to connect to the Hive metastore. Impala aggressively caches metadata so the metastore host should have minimal load. Impala relies on the HDFS NameNode, and, in CDH4, you can configure HA for HDFS. Impala also has a centralized soft-state service, known as the statestore, that runs on one host only. Impala continues to execute queries if the state store host is down, but it will not get state updates. For example, if a host is added to the cluster while the state store host is down, the existing instances of impalad running on the other hosts will not find out about this new host. Once the state store process is restarted, all the information it serves is automatically reconstructed from all running Impala daemons.
What is the maximum number of rows in a table?
There is no defined maximum. Some customers have used Impala to query a table with over a trillion rows.
Can Impala and MapReduce jobs run on the same cluster without resource contention?
Yes. See Controlling Resource Usage for how to control Impala resource usage using the Linux cgroup mechanism, and Using Resource Management with Impala [CDH 5 Only] for how to use Impala with the YARN resource management framework. Impala is designed to run on the DataNode hosts. Any contention depends mostly on the cluster setup and workload.
On which hosts does Impala run? Does it run on every DataNode in a cluster?
Running on each DataNode is the strong recommendation for good performance, although not a hard requirement. Impala schedules query fragments on all hosts holding data relevant to the query, if possible. Having a setup where there is data with no Impala daemons running on any of the replicas will have a very large penalty on performance. See Impala Concepts and Architecture for details about the Impala architecture.
How are joins performed in Impala?
The order in which tables are joined is the same order in which tables appear in the SELECT statement's FROM clause. That is, there is no join order optimization taking place at the moment. It is usually optimal for the smallest table to appear as the right-most table in a JOIN clause. Impala chooses between two techniques for join queries, known as "broadcast joins" and "partitioned joins". See Joins for syntax details and Join Queries for performance considerations.
What is the size limit for joins?
Impala utilizes multiple strategies to allow joins between tables and result sets of various sizes. When joining a large table with a small one, the data from the small table is transmitted to each node for intermediate processing. When joining two large tables, the data from one of the tables is divided into pieces, and each node processes only selected pieces. See Joins for details about join processing, and Hints for how to fine-tune the join strategy.
What is Impala's aggregation strategy?
Impala currently only supports in-memory hash aggregation.
How is Impala metadata managed?
Impala uses two pieces of metadata: the catalog information from the Hive metastore and the file metadata from the NameNode. Currently, this metadata is lazily populated and cached when an impalad needs it to plan a query.
The REFRESH statement updates the metadata for a particular table after loading new data through Hive. The INVALIDATE METADATA Statement statement refreshes all metadata, so that Impala recognizes new tables or other DDL and DML changes performed through Hive.
In Impala 1.2 and higher, a dedicated catalogd daemon broadcasts metadata changes due to Impala DDL or DML statements to all nodes, reducing or eliminating the need to use the REFRESH and INVALIDATE METADATA statements.
What load do concurrent queries produce on the NameNode?
The load we generate is very similar to MapReduce. Impala contacts the NameNode during the planning phase to get the file metadata (this is only run on the host the query was sent to). Every impalad will read files as part of normal processing of the query.
How does Impala achieve its performance improvements?
These are the main factors in the performance of Impala versus that of other Hadoop components and related technologies.
Impala avoids MapReduce. While MapReduce is a great general parallel processing model with many benefits, it is not designed to execute SQL. Impala avoids the inefficiencies of MapReduce in these ways:
- Impala does not materialize intermediate results to disk. SQL queries often map to multiple MapReduce jobs with all intermediate data sets written to disk.
- Impala avoids MapReduce start-up time. For interactive queries, the MapReduce start-up time becomes very noticeable. Impala runs as a service and essentially has no start-up time.
- Impala can more naturally disperse query plans instead of having to fit them into a pipeline of map and reduce jobs. This enables Impala to parallelize multiple stages of a query and avoid overheads such as sort and shuffle when unnecessary.
Impala uses a more efficient execution engine by taking advantage of modern hardware and technologies:
- Impala generates runtime code. Impala uses LLVM to generate assembly code for the query that is being run. Individual queries do not have to pay the overhead of running on a system that needs to be able to execute arbitrary queries.
- Impala uses available hardware instructions when possible. Impala uses the latest set of SSE (SSE4.2) instructions which can offer tremendous speedups in some cases.
- Impala uses better I/O scheduling. Impala is aware of the disk location of blocks and is able to schedule the order to process blocks to keep all disks busy.
- Impala is designed for performance. A lot of time has been spent in designing Impala with sound performance-oriented fundamentals, such as tight inner loops, inlined function calls, minimal branching, better use of cache, and minimal memory usage.
What happens when the data set exceeds available memory?
Currently, if the memory required to process intermediate results on a node exceeds the amount available to Impala on that node, the query is cancelled. You can adjust the memory available to Impala on each node, and you can fine-tune the join strategy to reduce the memory required for the biggest queries. We do plan on supporting external joins and sorting in the future.
Keep in mind though that the memory usage is not directly based on the input data set size. For aggregations, the memory usage is the number of rows after grouping. For joins, the memory usage is the combined size of the tables excluding the biggest table.
|<< Previous: Cloudera Impala Version and Download Information|