Tuning Impala for Performance
The following sections explain the factors affecting the performance of Impala features, and procedures for tuning, monitoring, and benchmarking Impala queries and other SQL operations.
This section also describes techniques for maximizing Impala scalability. Scalability is tied to performance: it means that performance remains high as the system workload increases. For example, reducing the disk I/O performed by a query can speed up an individual query, and at the same time improve scalability by making it practical to run more queries simultaneously. Sometimes, an optimization technique improves scalability more than performance. For example, reducing memory usage for a query might not change the query performance much, but might improve scalability by allowing more Impala queries or other kinds of jobs to run at the same time without running out of memory.
Before starting any performance tuning or benchmarking, make sure your system is configured with all the recommended minimum hardware requirements from Hardware Requirements and software settings from Post-Installation Configuration for Impala.
- Partitioning. This technique physically divides the data based on the different values in frequently queried columns, allowing queries to skip reading a large percentage of the data in a table.
- Performance Considerations for Join Queries. Joins are the main class of queries that you can tune at the SQL level, as opposed to changing physical factors such as the file format or the hardware configuration. The related topics Column Statistics and Table Statistics are also important primarily for join performance.
- Table Statistics and Column Statistics. Gathering table and column statistics, using the COMPUTE STATS statement, helps Impala automatically optimize the performance for join queries, without requiring changes to SQL query statements. (This process is greatly simplified in Impala 1.2.2 and higher, because the COMPUTE STATS statement gathers both kinds of statistics in one operation, and does not require any setup and configuration as was previously necessary for the ANALYZE TABLE statement in Hive.)
- Testing Impala Performance. Do some post-setup testing to ensure Impala is using optimal settings for performance, before conducting any benchmark tests.
- Benchmarking Impala Queries. The configuration and sample data that you use for initial experiments with Impala is often not appropriate for doing performance tests.
- Controlling Resource Usage. The more memory Impala can utilize, the better query performance you can expect. In a cluster running other kinds of workloads as well, you must make tradeoffs to make sure all Hadoop components have enough memory to perform well, so you might cap the memory that Impala can use.
- Impala Performance Guidelines and Best Practices
- Performance Considerations for Join Queries
- How Impala Uses Statistics for Query Optimization
- Benchmarking Impala Queries
- Controlling Resource Usage
- Using HDFS Caching with Impala (CDH 5.1 or later only)
- Testing Impala Performance
- Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles