This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Using YARN Resource Management with Impala (CDH 5 Only)

You can limit the CPU and memory resources used by Impala, to manage and prioritize workloads on clusters that run jobs from many Hadoop components. (Currently, there is no limit or throttling on the I/O for Impala queries.) In CDH 5, Impala can use the underlying Apache Hadoop YARN resource management framework, which allocates the required resources for each Impala query. Impala estimates the resources required by the query on each node of the cluster, and requests the resources from YARN.

Requests from Impala to YARN go through an intermediary service called Llama (Long-Lived Application Master). When the resource requests are granted, Impala starts the query and places all relevant execution threads into the CGroup containers and sets up the memory limit on each node. If sufficient resources are not available, the Impala query waits until other jobs complete and the resources are freed.

After a query is finished, Llama caches the resources (for example, leaving memory allocated) in case they are needed for subsequent Impala queries. This caching mechanism avoids the latency involved in making a whole new set of resource requests for each query. If the resources are needed by YARN for other types of jobs, Llama returns them.

While the delays to wait for resources might make individual queries seem less responsive on a heavily loaded cluster, the resource management feature makes the overall performance of the cluster smoother and more predictable, without sudden spikes in utilization due to memory paging, CPUs pegged at 100%, and so on.

  Warning: In CDH 5.0.0, the Llama component is in beta. It is intended for evaluation of resource management in test environments, in combination with Impala and YARN. It is currently not recommended for production deployment.

Continue reading:

The Llama Daemon

Llama is a system that mediates resource management between Cloudera Impala and Hadoop YARN. Llama enables Impala to reserve, use, and release resource allocations in a Hadoop cluster. Llama is only required if resource management is enabled in Impala.

By default, YARN allocates resources bit-by-bit as needed by MapReduce jobs. Impala needs all resources available at the same time, so that intermediate results can be exchanged between cluster nodes, and queries do not stall partway through waiting for new resources to be allocated. Llama is the intermediary process that ensures all requested resources are available before each Impala query actually begins.

For Llama installation instructions, see Llama installation.

For management through Cloudera Manager, see Adding the Llama Role.

Checking Resource Estimates and Actual Usage

To make resource usage easier to verify, the output of the EXPLAIN SQL statement now includes information about estimated memory usage, whether table and column statistics are available for each table, and the number of virtual cores that a query will use. You can get this information through the EXPLAIN statement without actually running the query. The extra information requires setting the query option EXPLAIN_LEVEL=verbose; see EXPLAIN Statement for details. The same extended information is shown at the start of the output from the PROFILE statement in impala-shell. The detailed profile information is only available after running the query. You can take appropriate actions (gathering statistics, adjusting query options) if you find that queries fail or run with suboptimal performance when resource management is enabled.

How Resource Limits Are Enforced

  • CPU limits are enforced by the Linux CGroups mechanism. YARN grants resources in the form of containers that correspond to CGroups on the respective machines.
  • Memory is enforced by Impala's query memory limits. Once a reservation request has been granted, Impala sets the query memory limit according to the granted amount of memory before executing the query.

Enabling Resource Management for Impala

To enable resource management for Impala, first you set up the YARN and Llama services for your CDH cluster. Then you add startup options and customize resource management settings for the Impala services.

Required CDH Setup for Resource Management with Impala

YARN is the general-purpose service that manages resources for many Hadoop components within a CDH cluster. Llama is a specialized service that acts as an intermediary between Impala and YARN, translating Impala resource requests to YARN and coordinating with Impala so that queries only begin executing when all needed resources have been granted by YARN.

For information about setting up the YARN and Llama services, see the instructions for YARN and Llama in the CDH 5 Installation Guide.

impalad Startup Options for Resource Management

The following startup options for impalad enable resource management and customize its parameters for your cluster configuration:
  • -enable_rm: Whether to enable resource management or not, either true or false. The default is false. None of the other resource management options have any effect unless -enable_rm is turned on.
  • -llama_host: Hostname or IP address of the Llama service that Impala should connect to. The default is 127.0.0.1.
  • -llama_port: Port of the Llama service that Impala should connect to. The default is 15000.
  • -llama_callback_port: Port that Impala should start its Llama callback service on. Llama reports when resources are granted or preempted through that service.
  • -cgroup_hierarchy_path: Path where YARN and Llama will create CGroups for granted resources. Impala assumes that the CGroup for an allocated container is created in the path 'cgroup_hierarchy_path + container_id'.

impala-shell Query Options for Resource Management

Before issuing SQL statements through the impala-shell interpreter, you can use the SET command to configure the following parameters related to resource management:

Limitations of Resource Management for Impala

Currently, Impala in CDH 5 has the following limitations for resource management of Impala queries:

  • Table statistics are required, and column statistics are highly valuable, for Impala to produce accurate estimates of how much memory to request from YARN. See Table Statistics and Column Statistics for instructions on gathering both kinds of statistics, and EXPLAIN Statement for the extended EXPLAIN output where you can check that statistics are available for a specific table and set of columns.
  • If the Impala estimate of required memory is lower than is actually required for a query, Impala will cancel the query when it exceeds the requested memory size. This could happen in some cases with complex queries, even when table and column statistics are available. You can see the actual memory usage after a failed query by issuing a PROFILE command in impala-shell. Specify a larger memory figure with the MEM_LIMIT query option and re-try the query.

    Currently, there are known bugs that could cause the maximum memory usage reported by the PROFILE command to be lower than the actual value.

  • The MEM_LIMIT query option, and the other resource-related query options, are not currently settable through the ODBC or JDBC interfaces.
Page generated September 3, 2015.