This is the documentation for CDH 4.7.1.
Documentation for other versions is available at Cloudera Documentation.

Deploying MapReduce v2 (YARN) on a Cluster

This section describes configuration tasks for YARN clusters only, and is specifically tailored for administrators who have installed YARN from packages.

Do these tasks after you have configured HDFS:

  1. Configure properties for YARN clusters
  2. Configure YARN daemons
  3. Configure the History Server
  4. Configure the Staging Directory
  5. Deploy your custom configuration to your entire cluster
  6. Start HDFS
  7. Create the HDFS /tmp directory
  8. Create the History Directory and Set Permissions
  9. Create Log Directories
  10. Verify the HDFS File Structure
  11. Start YARN and the MapReduce JobHistory Server
  12. Create a home directory for each MapReduce user
  13. Set HADOOP_MAPRED_HOME
  14. Configure the Hadoop daemons to start at boot time
  Note: Running Services

When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in/etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).)

About MapReduce v2 (YARN)

MapReduce has undergone a complete overhaul and CDH4 now includes MapReduce 2.0 (MRv2). The fundamental idea of MRv2's YARN architecture is to split up the two primary responsibilities of the JobTracker — resource management and job scheduling/monitoring — into separate daemons: a global ResourceManager (RM) and per-application ApplicationMasters (AM). With MRv2, the ResourceManager (RM) and per-node NodeManagers (NM), form the data-computation framework. The ResourceManager service effectively replaces the functions of the JobTracker, and NodeManagers run on slave nodes instead of TaskTracker daemons. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks. For details of the new architecture, see Apache Hadoop NextGen MapReduce (YARN).

  Note: Cloudera does not consider the current upstream MRv2 release stable yet, and it could potentially change in non-backwards-compatible ways. Cloudera recommends that you use MRv1 unless you have particular reasons for using MRv2, which should not be considered production-ready.

For more information about the two implementations (MRv1 and MRv2) see the discussion under Apache Hadoop MapReduce in the "What's New in Beta 1" section of New Features.

See also Selecting Appropriate JAR files for your MRv1 and YARN Jobs.

  Important:

Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not supported; it will degrade performance and may result in an unstable cluster deployment.

Step 1: Configure Properties for YARN Clusters

  Note:

Edit these files in the custom directory you created when you copied the Hadoop configuration. When you have finished, you will push this configuration to all the nodes in the cluster; see Step 4.

Property

Configuration File

Description

mapreduce.framework.name

conf/mapred-site.xml

If you plan on running YARN, you must set this property to the value of yarn.

Sample Configuration:

mapred-site.xml:

<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
</property>

Step 2: Configure YARN daemons

If you have decided to run YARN, you must configure the following services: ResourceManager (on a dedicated host) and NodeManager (on every host where you plan to run MapReduce v2 jobs).

The following table shows the most important properties that you must configure for your cluster in yarn-site.xml

Property

Recommended value

Description

yarn.nodemanager.aux-services

mapreduce.shuffle

Shuffle service that needs to be set for Map Reduce applications.

yarn.nodemanager.aux-services.
mapreduce.shuffle.class
org.apache.hadoop.mapred.
ShuffleHandler

The exact name of the class for shuffle service

yarn.resourcemanager.address

resourcemanager.company.com:8032

The address of the applications manager interface in the RM.

yarn.resourcemanager.
scheduler.address

resourcemanager.company.com:8030

The address of the scheduler interface.

yarn.resourcemanager.
resource-tracker.address

resourcemanager.company.com:8031

The address of the resource tracker interface.

yarn.resourcemanager.
admin.address

resourcemanager.company.com:8033

The address of the RM admin interface.

yarn.resourcemanager.
webapp.address

resourcemanager.company.com:8088

The address of the RM web application.

yarn.application.classpath

$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/*, $HADOOP_COMMON_HOME/lib/*, $HADOOP_HDFS_HOME/*, $HADOOP_HDFS_HOME/lib/*, $HADOOP_MAPRED_HOME/*, $HADOOP_MAPRED_HOME/lib/*, $YARN_HOME/*, $YARN_HOME/lib/* $HADOOP_YARN_HOME/*, $HADOOP_YARN_HOME/lib/*

Classpath for typical applications.

Next, you need to specify, create, and assign the correct permissions to the local directories where you want the YARN daemons to store data.

You specify the directories by configuring the following two properties in the yarn-site.xml file on all cluster nodes:

Property

Description

yarn.nodemanager.local-dirs

Specifies the URIs of the directories where the NodeManager stores its localized files. All of the files required for running a particular YARN application will be put here for the duration of the application run. Cloudera recommends that this property specify a directory on each of the JBOD mount points; for example, file:///data/1/yarn/local through file:///data/N/yarn/local.

yarn.nodemanager.log-dirs

Specifies the URIs of the directories where the NodeManager stores container log files. Cloudera recommends that this property specify a directory on each of the JBOD mount points; for example, file:///data/1/yarn/logs through file:///data/N/yarn/logs.

yarn.nodemanager.remote-app-log-dir

Specifies the URI of the directory where logs are aggregated. Set the value to either hdfs://namenode-host.company.com:8020/var/log/hadoop-yarn/apps, using the fully-qualified domain name of your NameNode host, or hdfs:/var/log/hadoop-yarn/apps. See also Step 9.

Here is an example configuration:

yarn-site.xml:

  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>resourcemanager.company.com:8031</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>resourcemanager.company.com:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>resourcemanager.company.com:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>resourcemanager.company.com:8033</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>resourcemanager.company.com:8088</value>
  </property>
  <property>
    <description>Classpath for typical applications.</description>
    <name>yarn.application.classpath</name>
    <value>
        $HADOOP_CONF_DIR,
        $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
        $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
        $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
        $YARN_HOME/*,$YARN_HOME/lib/*
    </value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce.shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>file:///data/1/yarn/local,file:///data/2/yarn/local,file:///data/3/yarn/local</value>
  </property>
  <property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>file:///data/1/yarn/logs,file:///data/2/yarn/logs,file:///data/3/yarn/logs</value>
  </property>
  <property>
    <description>Where to aggregate logs</description>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>hdfs:<namenode-host.company.com>:8020//var/log/hadoop-yarn/apps</value>
  </property>

After specifying these directories in the yarn-site.xml file, you must create the directories and assign the correct file permissions to them on each node in your cluster.

In the following instructions, local path examples are used to represent Hadoop parameters. Change the path examples to match your configuration.

To configure local storage directories for use by YARN:

  1. Create the yarn.nodemanager.local-dirs local directories:
    $ sudo mkdir -p /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local
  2. Create the yarn.nodemanager.log-dirs local directories:
    $ sudo mkdir -p /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs
  3. Configure the owner of the yarn.nodemanager.local-dirs directory to be the yarn user:
    $ sudo chown -R yarn:yarn /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local
  4. Configure the owner of the yarn.nodemanager.log-dirs directory to be the yarn user:
    $ sudo chown -R yarn:yarn /data/1/yarn/logs /data/2/yarn/logs /data/3/yarn/logs /data/4/yarn/logs

Here is a summary of the correct owner and permissions of the local directories:

Directory

Owner

Permissions

yarn.nodemanager.local-dirs

yarn:yarn

drwxr-xr-x

yarn.nodemanager.log-dirs

yarn:yarn

drwxr-xr-x

Step 3: Configure the History Server

If you have decided to run YARN on your cluster instead of MRv1, you should also run the MapReduce JobHistory Server. The following table shows the most important properties that you must configure in mapred-site.xml

Property

Recommended value

Description

mapreduce.jobhistory.address

historyserver.company.com:10020

The address of the JobHistory Server host:port

mapreduce.jobhistory.webapp.address

historyserver.company.com:19888

The address of the JobHistory Server web application host:port

In addition, make sure proxying is enabled for the mapred user; configure the following properties in core-site.xml:

Property

Recommended value

Description

hadoop.proxyuser.mapred.groups

*

Allows the mapreduser to move files belonging to users in these groups

hadoop.proxyuser.mapred.hosts

*

Allows the mapreduser to move files belonging on these hosts

Step 4: Configure the Staging Directory

YARN requires a staging directory for temporary files created by running jobs. By default it creates /tmp/hadoop-yarn/staging with restrictive permissions that may prevent your users from running jobs. To forestall this, you should configure and create the staging directory yourself; in the example that follows we use /user:

  1. Configure yarn.app.mapreduce.am.staging-dir in mapred-site.xml:
    <property>
        <name>yarn.app.mapreduce.am.staging-dir</name>
        <value>/user</value>
    </property>
  2. Once HDFS is up and running, you will create this directory and a history subdirectory under it (see Step 8).

Alternatively, you can do the following:

  1. Configure mapreduce.jobhistory.intermediate-done-dir and mapreduce.jobhistory.done-dir in yarn-site.xml.
  2. Create these two directories.
  3. Set permissions on mapreduce.jobhistory.intermediate-done-dir to 1777.
  4. Set permissions on mapreduce.jobhistory.done-dir to 750.

If you configure mapreduce.jobhistory.intermediate-done-dir and mapreduce.jobhistory.done-dir as above, you can skip Step 8.

Step 5: Deploy your custom Configuration to your Entire Cluster

To deploy your configuration to your entire cluster:

  1. Push your custom directory (for example /etc/hadoop/conf.my_cluster) to each node in your cluster; for example:
    $ sudo scp -r /etc/hadoop/conf.my_cluster myuser@myCDHnode-<n>.mycompany.com:/etc/hadoop/conf.my_cluster
  2. Manually set alternatives on each node to point to that directory, as follows.

    To manually set the configuration on Red Hat-compatible systems:

    $ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50 
    $ sudo alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

    To manually set the configuration on Ubuntu and SLES systems:

    $ sudo update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
    $ sudo update-alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

    For more information on alternatives, see the update-alternatives(8) man page on Ubuntu and SLES systems or the alternatives(8) man page On Red Hat-compatible systems.

Step 6: Start HDFS on Every Node in the Cluster

for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

Step 7: Create the HDFS /tmp Directory

  Important:

If you do not create /tmp properly, with the right permissions as shown below, you may have problems with CDH components later. Specifically, if you don't create /tmp yourself, another process may create it automatically with restrictive permissions that will prevent your other applications from using it.

Create the /tmp directory after HDFS is up and running, and set its permissions to 1777 (drwxrwxrwt), as follows:

$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
  Note:

If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command>

Step 8: Create the history Directory and Set Permissions and Owner

This is a subdirectory of the staging directory you configured in Step 4. In this example we're using /user/history. Create it and set permissions as follows:

sudo -u hdfs hadoop fs -mkdir /user/history
sudo -u hdfs hadoop fs -chmod -R 1777 /user/history
sudo -u hdfs hadoop fs -chown mapred:hadoop /user/history

Step 9: Create Log Directories

  Note:

See also Step 2.

Create the /var/log/hadoop-yarn directory and set ownership:

 
sudo -u hdfs hadoop fs -mkdir /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
 

You need to create this directory because it is the parent of /var/log/hadoop-yarn/apps which is explicitly configured in the yarn-site.xml.

Step 10: Verify the HDFS File Structure:

$ sudo -u hdfs hadoop fs -ls -R /

You should see:

drwxrwxrwt   - hdfs supergroup          0 2012-04-19 14:31 /tmp
drwxr-xr-x   - hdfs supergroup          0 2012-05-31 10:26 /user
drwxrwxrwt   - yarn supergroup          0 2012-04-19 14:31 /user/history
drwxr-xr-x   - hdfs   supergroup        0 2012-05-31 15:31 /var
drwxr-xr-x   - hdfs   supergroup        0 2012-05-31 15:31 /var/log
drwxr-xr-x   - yarn   mapred            0 2012-05-31 15:31 /var/log/hadoop-yarn

Step 11: Start YARN and the MapReduce JobHistory Server

To start YARN, start the ResourceManager and NodeManager services:

  Note:

Make sure you always start ResourceManager before starting NodeManager services.

On the ResourceManager system:

$ sudo service hadoop-yarn-resourcemanager start

On each NodeManager system (typically the same ones where DataNode service runs):

$ sudo service hadoop-yarn-nodemanager start

To start the MapReduce JobHistory Server

On the MapReduce JobHistory Server system:

$ sudo service hadoop-mapreduce-historyserver start

Step 12: Create a Home Directory for each MapReduce User

Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

$ sudo -u hdfs hadoop fs -mkdir  /user/<user>
$ sudo -u hdfs hadoop fs -chown <user> /user/<user>

where <user> is the Linux username of each user.

Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

sudo -u hdfs hadoop fs -mkdir /user/$USER
sudo -u hdfs hadoop fs -chown $USER /user/$USER

Step 13: Set HADOOP_MAPRED_HOME

For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop in a YARN installation, set the HADOOP_MAPRED_HOME environment variable as follows:

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Step 14: Configure the Hadoop Daemons to Start at Boot Time