Work Preserving Recovery for YARN Components

CDH 5.2 introduces work preserving recovery for the YARN ResourceManager and NodeManager. With work preserving recovery enabled, if a ResourceManager or NodeManager restarts, no in-flight work is lost. You can configure work preserving recovery separately for a ResourceManager or NodeManager.

Prerequisites

To use work preserving recovery for the ResourceManager, you need to first enable High Availability for the ResourceManager. See YARN (MRv2) ResourceManager High Availability for more information.

Configuring Work Preserving Recovery Using Cloudera Manager

Use this procedure if you manage your CDH cluster using Cloudera Manager. Otherwise, see Configuring Work Preserving Recovery Using the Command Line.

If you use Cloudera Manager and you enable YARN (MRv2) ResourceManager High Availability, work preserving recovery is enabled by default for the ResourceManager . To disable the feature for the ResourceManager, change the value of yarn.resourcemanager.work-preserving-recovery.enabled to false in the yarn-site.xml using an advanced configuration snippet.

To enable the feature for a given NodeManager, edit the advanced configuration snippet for yarn-site.xml on that NodeManager, and set the value of yarn.nodemanager.recovery.enabled to true. For a given NodeManager, you can configure the directory on the local filesystem where state information is stored when work preserving recovery is enabled, by setting the value of yarn.nodemanager.recovery.dir to a local filesystem directory, using the same advanced configuration snippet. The default value is ${hadoop.tmp.dir}/yarn-nm-recovery. This location usually points to the /tmp directory on the local filesystem. Because many operating systems do not preserve the contents of the /tmp directory across a reboot, Cloudera strongly recommends changing the location of yarn.nodemanager.recover.dir to a different directory on the local filesystem. The example below uses /home/cloudera/recovery.

Configuring Work Preserving Recovery Using the Command Line

After enabling ResourceManager High Availability, edit yarn-site.xml on the ResourceManager and all NodeManagers.
  1. Set the value of yarn.resourcemanager.work-preserving-recovery.enabled to true to enable work preserving recovery for the ResourceManager, and set the value of yarn.nodemanager.recovery.enabled to true for the NodeManager.
  2. For each NodeManager, configure the directory on the local filesystem where state information is stored when work preserving recovery is enabled, by setting the value of yarn.nodemanager.recovery.dir to a local filesystem directory. The default value is ${hadoop.tmp.dir}/yarn-nm-recovery. This location usually points to the /tmp directory on the local filesystem. Because many operating systems do not preserve the contents of the /tmp directory across a reboot, Cloudera strongly recommends changing the location of yarn.nodemanager.recover.dir to a different directory on the local filesystem. The example below uses /home/cloudera/recovery.
  3. Configure a valid RPC address for the NodeManager, by setting yarn.nodemanager.address to an address with a specific port number (such as 0.0.0.0:45454). Ephemeral ports (port 0, which is default) cannot be used for the NodeManager's RPC server, because this could cause the NodeManager to use different ports before and after a restart, preventing clients from connecting to the NodeManager after a restart. The NodeManager RPC address is also important for auxiliary services which run in a YARN cluster.

Auxiliary services should also be designed to support recoverability by reloading the previous state after a NodeManager restarts. An example auxiliary service, the ShuffleHandler service for MapReduce, follows the correct pattern for an auxiliary service which supports work preserving recovery of the NodeManager.

Example Configuration for Work Preserving Recovery

The following example configuration can be used with a Cloudera Manager advanced configuration snippet or added to yarn-site.xml directly if you do not use Cloudera Manager. Adjust the configuration to suit your environment.
<property>
  <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
  <value>true</value>
  <description>Whether to enable work preserving recovery for the Resource Manager</description>
</property>
<property>
  <name>yarn.nodemanager.work-preserving-recovery.enabled</name>
  <value>true</value>
  <description>Whether to enable work preserving recovery for the Node Manager</description>
</property>
<property>
  <name>yarn.nodemanager.recovery.dir</name>
  <value>/home/cloudera/recovery</value>
  <description>The location for stored state on the Node Manager, if work preserving recovery 
    is enabled.</description>
</property>