Resource Management for Impala
You can limit the CPU and memory resources used by Impala, to manage and prioritize workloads on clusters that run jobs from many Hadoop components. (Currently, there is no limit or throttling on the I/O for Impala queries.) In CDH 5, Impala can use the underlying Apache Hadoop YARN resource management framework, which allocates the required resources for each Impala query. Impala estimates the resources required by the query on each host of the cluster, and requests the resources from YARN.
Controlling Resource Estimation Behavior
By default, Impala consults the table statistics and column statistics for each table in a query, and uses those figures to construct estimates of needed resources for each query. See COMPUTE STATS Statement for the statement to collect table and column statistics for a table.
In CDH 5.7 / Impala 2.5 and higher, the preferred way to avoid overcommitting memory in a high-concurrency or multitenant scenario is to use Impala admission control together with dynamic resource pools. You can specify a Default Query Memory Limit setting, with a different value for each pool, and Impala uses that value to calculate how many queries can safely run within a specified cluster-wide aggregate memory size. See Admission Control and Query Queuing for details.
Checking Resource Estimates and Actual Usage
To make resource usage easier to verify, the output of the EXPLAIN SQL statement now includes information about estimated memory usage, whether table and column statistics are available for each table, and the number of virtual cores that a query will use. You can get this information through the EXPLAIN statement without actually running the query. The extra information requires setting the query option EXPLAIN_LEVEL=verbose; see EXPLAIN Statement for details. The same extended information is shown at the start of the output from the PROFILE statement in impala-shell. The detailed profile information is only available after running the query. You can take appropriate actions (gathering statistics, adjusting query options) if you find that queries fail or run with suboptimal performance when resource management is enabled.
How Resource Limits Are Enforced
- CPU limits are enforced by the Linux cgroups mechanism. YARN grants resources in the form of containers that correspond to cgroups on the respective machines.
- Memory is enforced by Impala's query memory limits. Once a reservation request has been granted, Impala sets the query memory limit according to the granted amount of memory before executing the query.
Enabling Resource Management for Impala
To enable resource management for Impala, first you set up the YARN service for your CDH cluster. Then you add startup options and customize resource management settings for the Impala services.
Required CDH Setup for Resource Management with Impala
YARN is the general-purpose service that manages resources for many Hadoop components within a CDH cluster.
For information about setting up the YARN service, see the instructions for Cloudera Manager.
impalad Startup Options for Resource Management
- -enable_rm: Whether to enable resource management or not, either true or false. The default is false. None of the other resource management options have any effect unless -enable_rm is turned on.
- -cgroup_hierarchy_path: Path where YARN will create cgroups for granted resources. Impala assumes that the cgroup for an allocated container is created in the path 'cgroup_hierarchy_path + container_id'.
- -rm_always_use_defaults: If this Boolean option is enabled, Impala ignores computed estimates and always obtains the default memory and CPU allocation settings at the start of the query. These default estimates are approximately 2 CPUs and 4 GB of memory, possibly varying slightly depending on cluster size, workload, and so on. Cloudera recommends enabling -rm_always_use_defaults whenever resource management is used, and relying on these default values (that is, leaving out the two following options).
- -rm_default_memory=size: Optionally sets the default estimate for memory usage for each query. You can use suffixes such as M and G for megabytes and gigabytes, the same as with the MEM_LIMIT query option. Only has an effect when -rm_always_use_defaults is also enabled.
- -rm_default_cpu_cores: Optionally sets the default estimate for number of virtual CPU cores for each query. Only has an effect when -rm_always_use_defaults is also enabled.
impala-shell Query Options for Resource Management
Before issuing SQL statements through the impala-shell interpreter, you can use the SET command to configure the following parameters related to resource management:
Limitations of Resource Management for Impala
Currently, Impala in CDH 5 has the following limitations for resource management of Impala queries:
- Table statistics are required, and column statistics are highly valuable, for Impala to produce accurate estimates of how much memory to request from YARN. See Overview of Table Statistics and Overview of Column Statistics for instructions on gathering both kinds of statistics, and EXPLAIN Statement for the extended EXPLAIN output where you can check that statistics are available for a specific table and set of columns.
- If the Impala estimate of required memory is lower than is actually required for a query, Impala dynamically expands the amount of requested memory. Queries might still be cancelled if the reservation expansion fails, for example if there are insufficient remaining resources for that pool, or the expansion request takes long enough that it exceeds the query timeout interval, or because of YARN preemption. You can see the actual memory usage after a failed query by issuing a PROFILE command in impala-shell. Specify a larger memory figure with the MEM_LIMIT query option and re-try the query.
The MEM_LIMIT query option, and the other resource-related query options, are settable through the ODBC or JDBC interfaces in Impala 2.0 and higher. This is a former limitation that is now lifted.