In this multipart series, fully explore the tangled ball of thread that is YARN.
YARN (Yet Another Resource Negotiator) is the resource management layer for the Apache Hadoop ecosystem. YARN has been available for several releases, but many users still have fundamental questions about what YARN is, what it’s for, and how it works. This new series of blog posts is designed with the following goals in mind:
The series comprises the following parts:
In this initial post, we’ll cover the fundamentals of YARN, which runs processes on a cluster similarly to the way an operating system runs processes on a standalone computer. Subsequent parts will be released every few weeks.
A host is the Hadoop term for a computer (also called a node, in YARN terminology). A cluster is two or more hosts connected by a high-speed local network. Two or more hosts—the Hadoop term for a computer (also called a node in YARN terminology)—connected by a high-speed local network are called a cluster. From the standpoint of Hadoop, there can be several thousand hosts in a cluster.
In Hadoop, there are two types of hosts in the cluster.
Figure 1: Master host and Worker hosts
Conceptually, a master host is the communication point for a client program. A master host sends the work to the rest of the cluster, which consists of worker hosts. (In Hadoop, a cluster can technically be a single host. Such a setup is typically used for debugging or simple testing, and is not recommended for a typical Hadoop workload.)
In a YARN cluster, there are two types of hosts:
Figure 2: Master host with ResourceManager and Worker hosts with NodeManager
The YARN configuration file is an XML file that contains properties. This file is placed in a well-known location on each host in the cluster and is used to configure the ResourceManager and NodeManager. By default, this file is named yarn-site.xml. The basic properties in this file used to configure YARN are covered in the later sections.
YARN currently defines two resources, vcores and memory. Each NodeManager tracks its own local resources and communicates its resource configuration to the ResourceManager, which keeps a running total of the cluster’s available resources. By keeping track of the total, the ResourceManager knows how to allocate resources as they are requested. (Vcore has a special meaning in YARN. You can think of it simply as a “usage share of a CPU core.” If you expect your tasks to be less CPU-intensive (sometimes called I/O-intensive), you can set the ratio of vcores to physical cores higher than 1 to maximize your use of hardware resources.)
Figure 3: ResourceManager global view of the cluster
Containers are an important YARN concept. You can think of a container as a request to hold resources on the YARN cluster. Currently, a container hold request consists of vcore and memory, as shown in Figure 4 (left).
Figure 4: Container as a hold (left), and container as a running process (right)
Once a hold has been granted on a host, the NodeManager launches a process called a task. The right side of Figure 4 shows the task running as a process inside a container. (Part 3 will cover, in more detail, how YARN schedules a container on a particular host.)
For the next section, two new YARN terms need to be defined:
An application running tasks on a YARN cluster consists of the following steps:
Figure 5: Application starting up before tasks are assigned to the cluster
2. The ResourceManager makes a single container request on behalf of the application:
Figure 6: Application + allocated container on a cluster
3. The ApplicationMaster starts running within that container:
Figure 7: Application + ApplicationMaster running in the container on the cluster
4. The ApplicationMaster requests subsequent containers from the ResourceManager that are allocated to run tasks for the application. Those tasks do most of the status communication with the ApplicationMaster allocated in Step 3):
Figure 8: Application + ApplicationMaster + task running in multiple containers running on the cluster
5. Once all tasks are finished, the ApplicationMaster exits. The last container is de-allocated from the cluster.
6. The application client exits. (The ApplicationMaster launched in a container is more specifically called a managed AM. Unmanaged ApplicationMasters run outside of YARN’s control. Llama is an example of an unmanaged AM.)
In the MapReduce paradigm, an application consists of Map tasks and Reduce tasks. Map tasks and Reduce tasks align very cleanly with YARN tasks.
Figure 9: Application + Map tasks + Reduce tasks
Figure 10 illustrates how the map tasks and the reduce tasks map cleanly to the YARN concept of tasks running in a cluster.
Figure 10: Merged MapReduce/YARN Application Running on a Cluster
In a MapReduce application, there are multiple map tasks, each running in a container on a worker host somewhere in the cluster. Similarly, there are multiple reduce tasks, also each running in a container on a worker host.
Simultaneously on the YARN side, the ResourceManager, NodeManager, and ApplicationMaster work together to manage the cluster’s resources and ensure that the tasks, as well as the corresponding application, finish cleanly.
Summarizing the important concepts presented in this section:
Part 2 will cover calculating YARN properties for cluster configuration. In the meantime, consider this further reading:
Ray Chiang is a Software Engineer at Cloudera.
Dennis Dawson is a Senior Technical Writer at Cloudera.
This may have been caused by one of the following: