Managing YARN

For an overview of computation frameworks, insight into their usage and restrictions, and examples of common tasks they perform, see Managing YARN (MRv2) and MapReduce (MRv1).

Continue reading:

Adding the YARN Service
Configuring Directories
Importing MapReduce Configurations to YARN
Configuring the YARN Scheduler
Dynamic Resource Management
Enable GPU Using Cloudera Manager
Create Custom Resource Using Cloudera Manager
Configuring YARN for Long-running Applications
Task Process Exit Codes

Adding the YARN Service

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

On the Home > Status tab, click to the right of the cluster name and select Add a Service. A list of service types display. You can add one type of service at a time.
Select YARN (MR2 Included) and click Continue.
Select the services on which the new service should depend. All services must depend on the same ZooKeeper service. Click Continue.

Customize the assignment of role instances to hosts. The wizard evaluates the hardware configurations of the hosts to determine the best hosts for each role. The wizard assigns all worker roles to the same set of hosts to which the HDFS DataNode role is assigned. You can reassign role instances.

Click a field below a role to display a dialog box containing a list of hosts. If you click a field containing multiple hosts, you can also select All Hosts to assign the role to all hosts, or Custom to display the hosts dialog box.

The following shortcuts for specifying hostname patterns are supported:

Range of hostnames (without the domain portion)

Range Definition	Matching Hosts
10.1.1.[1-4]	10.1.1.1, 10.1.1.2, 10.1.1.3, 10.1.1.4
host[1-3].company.com	host1.company.com, host2.company.com, host3.company.com
host[07-10].company.com	host07.company.com, host08.company.com, host09.company.com, host10.company.com

IP addresses
Rack name

Click the View By Host button for an overview of the role assignment by hostname ranges.

Configuring Memory Settings for YARN and MRv2

The memory configuration for YARN and MRv2 memory is important to get the best performance from your cluster. Several different settings are involved. The table below shows the default settings, as well as the settings that Cloudera recommends, for each configuration option. See Managing YARN (MRv2) and MapReduce (MRv1) for more configuration specifics; and, for detailed tuning advice with sample configurations, see Tuning YARN.

YARN and MRv2 Memory Configuration
Cloudera Manager Property Name	CDH Property Name	Default Configuration	Cloudera Tuning Guidelines
Container Memory Minimum	yarn.scheduler. minimum-allocation-mb	1 GB	0
Container Memory Maximum	yarn.scheduler. maximum-allocation-mb	64 GB	amount of memory on largest host
Container Memory Increment	yarn.scheduler. increment-allocation-mb	512 MB	Use a fairly large value, such as 128 MB
Container Memory	yarn.nodemanager. resource.memory-mb	8 GB	8 GB
Map Task Memory	`mapreduce.map.memory.mb`	1 GB	1 GB
Reduce Task Memory	`mapreduce.reduce.memory.mb`	1 GB	1 GB
Map Task Java Opts Base	`mapreduce.map.java.opts`	-Djava.net.preferIPv4Stack=true	-Djava.net.preferIPv4Stack=true -Xmx768m
Reduce Task Java Opts Base	`mapreduce.reduce.java.opts`	-Djava.net.preferIPv4Stack=true	-Djava.net.preferIPv4Stack=true -Xmx768m
ApplicationMaster Memory	yarn.app.mapreduce. am.resource.mb	1 GB	1 GB
ApplicationMaster Java Opts Base	yarn.app.mapreduce. am.command-opts	-Djava.net.preferIPv4Stack=true	-Djava.net.preferIPv4Stack=true -Xmx768m

Configuring Directories

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

Creating the Job History Directory

When adding the YARN service, the Add Service wizard automatically creates a job history directory. If you quit the Add Service wizard or it does not finish, you can create the directory outside the wizard:

Go to the YARN service.
Select Actions > Create Job History Dir.
Click Create Job History Dir again to confirm.

Creating the NodeManager Remote Application Log Directory

When adding the YARN service, the Add Service wizard automatically creates a remote application log directory. If you quit the Add Service wizard or it does not finish, you can create the directory outside the wizard:

Go to the YARN service.
Select Actions > Create NodeManager Remote Application Log Directory.
Click Create NodeManager Remote Application Log Directory again to confirm.

Importing MapReduce Configurations to YARN

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

You can import MapReduce configurations to YARN as part of the upgrade wizard. If you do not import configurations during upgrade, you can manually import the configurations at a later time:

Go to the YARN service page.
Stop the YARN service.
Select Actions > Import MapReduce Configuration. The import wizard presents a warning letting you know that it will import your configuration, restart the YARN service and its dependent services, and update the client configuration.
Click Continue to proceed. The next page indicates some additional configuration required by YARN.
Verify or modify the configurations and click Continue. The Switch Cluster to MR2 step proceeds.
When all steps have been completed, click Finish.
(Optional) Remove the MapReduce service.
1. Click the Cloudera Manager logo to return to the Home page.
2. In the MapReduce row, right-click and select Delete. Click Delete to confirm.
Recompile JARs used in MapReduce applications. For further information, see For MapReduce Programmers: Writing and Running Jobs.

Configuring the YARN Scheduler

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

The YARN service is configured by default to use the Fair Scheduler. You can change the scheduler type to FIFO or Capacity Scheduler. You can also modify the Fair Scheduler and Capacity Scheduler configuration. For further information on schedulers, see YARN (MRv2) and MapReduce (MRv1) Schedulers.

Configuring the Scheduler Type

Go to the YARN service.
Click the Configuration tab.
Select Scope > ResourceManager.
Select Category > Main.
Select a scheduler class.
To apply this configuration property to other role groups as needed, edit the value for the appropriate role group. See Modifying Configuration Properties Using Cloudera Manager.
Enter a Reason for change, and then click Save Changes to commit the changes.
Restart the YARN service.

Modifying the Scheduler Configuration

Go to the YARN service.
Click the Configuration tab.
Click the ResourceManager Default Group category.
Select Scope > ResourceManager.
Type Scheduler in the Search box.
Locate a property and modify the configuration.
To apply this configuration property to other role groups as needed, edit the value for the appropriate role group. See Modifying Configuration Properties Using Cloudera Manager.
Enter a Reason for change, and then click Save Changes to commit the changes.
Restart the YARN service.

Dynamic Resource Management

In addition to the static resource management available to all services, the YARN service also supports dynamic management of its static allocation. See Dynamic Resource Pools.

Enable GPU Using Cloudera Manager

The Fair Scheduler does not support Dominant Resource Calculator. The fairshare policy that the Fair Scheduler uses considers only the memory for fairShare and minShare calculation, therefore GPU devices will be allocated from a common pool.

Install NVIDIA GPU with Ubuntu Linux

The following is an example of how to setup NVIDIA drivers + NVIDIA-smi. This example is specific for Ubuntu 16.04 or later.

Add repository for the driver and the toolkit:

sudo apt-add-repository ppa:graphics-drivers/ppa -y

Update repositories:
```
sudo apt update
```
Install the NVIDIA GPU driver:
```
sudo apt install -y nvidia-415
```
Install the NVIDIA toolkit which contains the GPU management tools like NVIDIA-smi:
```
sudo apt install -y nvidia-cuda-toolkit
```
Restart the machine:
```
sudo reboot
```

Enable GPU

In Cloudera Manager, select Hosts / All Host.
Choose the hosts on which you want to enable GPU.
Click the Configuration button.
Search for cgroup.
Find and select Enable Cgroup-based Resource Management.
Click Save Changes.
Return to the Home page by clicking the Cloudera Manager logo.
Select the YARN service.
Click the Configuration tab.
Search for cgroup.
Find and enable the following options:
- Use CGroups for Resource Management
  An error message is displayed if you enable this option but have not enabled Enable Cgroup-based Resource Management in step 5.
- Always Use Linux Container Executor
Click Save Changes.
Search for gpu.
Find Enable GPU Usage and select the role groups on which you want to enable GPU as a resource.
For more information about how to create role groups, see Role Groups.
Find NodeManager GPU Devices Allowed and define the GPU devices that are managed by YARN. There are two ways to perform this step:
- Use the default value, auto, for auto detection of all GPU devices. In this case all GPU devices are managed by YARN.
- Manually define the GPU devices that are managed by YARN.
  For more information about how to define these GPU devices manually, see the YARN documentation on Using GPU on YARN.
Find NodeManager GPU Detection Executable and define the location of nvidia-smi.
By default this property has no value, which means that YARN will check the following paths to find nvidia-smi:
- /usr/bin
- /bin
- /usr/local/nvidia/bin
Click Save Changes.
Select Hosts / All Hosts.
Find the the host with the ResourceManager role and then click on the ResourceManager role.
Click the Process tab.
Open the fair-scheduler.xml and copy its content.
Return to the Home page by clicking the Cloudera Manager logo.
Select the YARN service.
Click the Configuration tab.
Search for fair scheduler.
Find Fair Scheduler XML Advanced Configuration Snippet (Safety Valves).
Paste the content you copied from the fair-scheduler.xml.
Configure the queues:
- If you want to disable GPU allocation for a queue, set maxResources to 0 for that queue :
```
<maxResources>vcores=20, memory-mb=20480, yarn.io/gpu=0</maxResources>
```
- In order to achieve a more fair GPU allocation, set schedulingPolicy to drf (Dominant Resource Fairness) for queues to which you want to allocate GPUs:
```
<schedulingPolicy>drf</schedulingPolicy>
```
Important: When Fair Scheduler XML Advanced Configuration Snippet (Safety Valves) is used, it overwrites the Dynamic Resource Pool UI.

For more information about the Allocation file format, see YARN documentation on Fair Scheduler.
Click Save changes.

Click the Stale Service Restart icon that is next to the service to invoke the cluster restart wizard.

As a result the changes in the stale configuration are displayed.

In container-executor.cfg:

[gpu]
module.enabled = true
[cgroups]
root = /var/lib/yarn-ce/cgroups
yarn-hierarchy = /hadoop-yarn

In yarn-site.xml:

<name>yarn.resource-types</name>
<value>yarn.io/gpu</value>
</property>
<property>

 <name>yarn.resource-types</name>
<value>yarn.io/gpu</value>
</property>
<property>
<name>yarn.nodemanager.resource-plugins</name>
<value>yarn.io/gpu</value>
</property>
<property>
<name>yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices</name>
<value>auto</value>
</property>
<property>

-    <value>org.apache.hadoop.yarn.server.nodemanager.util.DefaultLCEResourcesHandler</value>

+    <value>org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler</value>

<name>yarn.nodemanager.linux-container-executor.cgroups.hierarchy</name>
<value>{{CGROUP_GROUP_CPU}}/hadoop-yarn</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.cgroups.mount-path</name>
<value>/var/lib/yarn-ce/cgroups</value>
</property>
<property>

In System Resources:

@@ -1,4 +1,7 @@
{"dynamic": true, "directory": null, "file": null, "tcp_listen": null, "cpu": {"shares": 1024}, "named_cpu": null, "io": null, "memory": null, "rlimits": null, "contents": null, "install": null, "named_resource": null}
{"dynamic": true, "directory": null, "file": null, "tcp_listen": null, "cpu": null, "named_cpu": null, "io": {"weight": 500}, "memory": null, "rlimits": null, "contents": null, "install": null, "named_resource": null}
{"dynamic": false, "directory": null, "file": null, "tcp_listen": null, "cpu": null, "named_cpu": null, "io": null, "memory": {"soft_limit": -1, "hard_limit": -1}, "rlimits": null, "contents": null, "install": null, "named_resource": null}
...
...

@@ -6,4 +9,7 @@
{"dynamic": false, "directory": null, "file": null, "tcp_listen": null, "cpu": {"shares": 1024}, "named_cpu": null, "io": null, "memory": null, "rlimits": null, "contents": null, "install": null, "named_resource": {"type": "cpu", "name": "hadoop-yarn", "user": "yarn", "group": "hadoop", "mode": 509, "cpu": null, "blkio": null, "memory": null}}
{"dynamic": false, "directory": null, "file": null, "tcp_listen": null, "cpu": null, "named_cpu": null, "io": null, "memory": null, "rlimits": null, "contents": null, "install": null, "named_resource": {"type": "devices", "name": "hadoop-yarn", "user": "yarn", "group": "hadoop", "mode": 509, "cpu": null, "blkio": null, "memory": null}}
{"dynamic": false, "directory": {"path": "/var/lib/yarn-ce/cgroups", "user": "yarn", "group": "hadoop", "mode": 509, "bytes_free_warning_threshhold_bytes": 0}, "file": null, "tcp_listen": null, "cpu": null, "named_cpu": null, "io": null, "memory": null, "rlimits": null, "contents": null, "install": null, "named_resource": null}

Click Restart Stale Services.
Select Re-deploy client configuration.
Click Restart Now.

Create Custom Resource Using Cloudera Manager

This section describes how to define a Resource Type other than memory, CPU or GPU. For instructions on how to enable GPU as a resource, see Enable GPU Using Cloudera Manager.

Select the YARN service in Cloudera Manager.
Click the Configuration tab.
Search for resource types.
Find Resource Types.
Click the plus icon on the right side to create a new Resource Type.
Add the following properties:
- Name: Add the name of the resource type. For more information about valid resource name, see YARN Resource Configuration.
- Description: Add description of the resource type. Optional property.
- Minimum value: Define the minimum allocation size of the resource type for a container. Its default value is 0.
- Maximum value: Define the maximum allocation size of the resource type for a container. Its default value is Long.MAX_VALUE.
- Unit: Define the default unit of the resource type.
Find Resource Allocation.
Allocate the created resource type to a Role Group by clicking the plus icon on the right side of the row of the particular Role Group.
Add value to define the size of the allocated resource to that Role Group.
Click Save Changes.
If you want to constraint the queues for the defined Resource Type, do the followings:
1. Select the Hosts / All hosts tab.
2. Find the the host with the ResourceManager role and then click on the ResourceManager role.
3. Click the Process tab.
4. Open the fair-scheduler.xml and copy its content.
5. Return to the Home page by clicking the Cloudera Manager logo.
6. Select the YARN service.
7. Click the Configuration tab.
8. Search for fair scheduler.
9. Find Fair Scheduler XML Advanced Configuration Snippet (Safety Valves).
10. Paste the content you copied from the fair-scheduler.xml.
11. Specify the custom resources using the vcores=X, memory-mb=Y format.
  
  Important: When Fair Scheduler XML Advanced Configuration Snippet (Safety Valves) is used it overwrites the Dynamic Resource Pool UI.
  
  For more information about the Allocation file format, see YARN documentation on FairScheduler.
Click Save Changes.

Click the Stale Service Restart icon that is next to the service to invoke the cluster restart wizard.

As a result the changes in the stale configuration are displays.

In yarn-conf/yarn-site.xml:

<name>yarn.resource-types</name>
<value>fpga</value>
</property>
<property>
<name>yarn.resource-types.fpga.minimum-allocation</name>
<value>0</value>
</property>
<property>
<name>yarn.resource-types.fpga.maximum-allocation</name>
<value>10</value>
</property>
<property>

In yarn-site.xml:

<name>yarn.resource-types</name>
<value>fpga</value>
</property>
<property>
<name>yarn.resource-types.fpga.minimum-allocation</name>
<value>0</value>
</property>
<property>
<name>yarn.resource-types.fpga.maximum-allocation</name>
<value>10</value>
</property>
<property>

<name>yarn.resource-types</name>
<value>fpga</value>
</property>
<property>
<name>yarn.nodemanager.resource-type.fpga</name>
<value>2</value>
</property>
<property>

Click Restart Stale Services.
Select the Re-deploy client configuration.
Click Restart Now.

Configuring YARN for Long-running Applications

On a secure cluster, long-running applications such as Spark Streaming jobs will need additional configuration since the default settings only allow the hdfs user's delegation tokens a maximum lifetime of 7 days, which is not always sufficient.

Task Process Exit Codes

All YARN tasks on the NodeManager are run in a JVM. When a task runs successfully, the exit code is 0. Exit codes of 0 are not logged, as they are the expected result. Any non-zero exit code is logged as an error. The non-zero exit code is reported by the NodeManager as an error in the child process. The NodeManager itself is not affected by the error.

The task JVM might exit with a non-zero code for multiple reasons, though there is no exhaustive list. Exit codes can be split into two categories:

Set by the JVM based on the OS signal received by the JVM
Directly set in the code

Signal-Related Exit Codes

When the OS sends a signal to the JVM, the JVM handles the signal, which could cause the JVM to exit. Not all signals cause the JVM to exit. Exit codes for OS signals have a value between 128 and 160. Logs show non-zero status codes without further explanation.

Two exit values that typically do not require investigation are 137 and 143. These values are logged when the JVM is killed by the NodeManager or the OS. The NodeManager might kill a JVM due to task preemption (if that is configured) or a speculative run. The OS might kill the JVM when the JVM exceeds system limits like CPU time. You should investigate these codes if they appear frequently, as they might indicate a misconfiguration or a structural problem with regard to resources.

Exit code 154 is used in RecoveredContainerLaunch#call to indicate containers that were lost between NodeManager restarts without an exit code being recorded. This is usually a bug, and requires investigation.

Other Exit Codes

The JVM might exit if there is an unrecoverable error while running a task. The exit code and the message logged should provide more detail. A Java stack trace might also be logged as part of the exit. These exits should be investigated further to discover a root cause.

In the case of a streaming MapReduce job, the exit code of the JVM is the same as the mapper or reducer in use. The mapper or reducer can be a shell script or Python script. This means that the underlying script dictates the exit code: in streaming jobs, you should take this into account during your investigation.

YARN (MRv2) and MapReduce (MRv1)

Managing YARN ACLs