This is the documentation for CDH 4.7.0.
Documentation for other versions is available at Cloudera Documentation.

Upgrading to the Latest Version of CDH4

Use the instructions that follow to upgrade to the latest version of CDH4.

Step 1: Prepare the cluster for the upgrade.

  1. Shut down Hadoop services across your entire cluster by running the following command on every host in your cluster:
    $ for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x stop ; done
  2. Check each host to make sure that there are no processes running as the hdfs, yarn, mapred or httpfs users from root:
    # ps -aef | grep java
      Important:

    When you are sure that all Hadoop services have been shut down, do the following step. It is particularly important that the NameNode service is not running so that you can make a consistent backup.

  3. Back up the HDFS metadata on the NameNode machine, as follows.
      Note:
    • Cloudera recommends backing up HDFS metadata on a regular basis, as well as before a major upgrade.
    • dfs.name.dir is deprecated but still works; dfs.namenode.name.dir is preferred. This example uses dfs.name.dir.
    1. Find the location of your dfs.name.dir (or dfs.namenode.name.dir); for example:
      $ grep -C1 dfs.name.dir /etc/hadoop/conf/hdfs-site.xml 
      <property> <name>dfs.name.dir</name> <value>/mnt/hadoop/hdfs/name</value> 
      </property>
    2. Back up the directory. The path inside the <value> XML element is the path to your HDFS metadata. If you see a comma-separated list of paths, there is no need to back up all of them; they store the same data. Back up the first directory, for example, by using the following commands:
      $ cd /mnt/hadoop/hdfs/name
      # tar -cvf /root/nn_backup_data.tar .
      ./ 
      ./current/
      ./current/fsimage 
      ./current/fstime 
      ./current/VERSION 
      ./current/edits 
      ./image/ 
      ./image/fsimage
        Warning:

      If you see a file containing the word lock, the NameNode is probably still running. Repeat the preceding steps from the beginning; start at Step 1 and shut down the Hadoop services.

Step 2: Download the CDH4 package on each of the hosts in your cluster.

Before you begin: Check whether you have the CDH4 "1-click" repository installed.

  • On Red Hat/CentOS-compatible and SLES systems:
rpm -q cdh4-repository

If you are upgrading from CDH4 Beta 1 or later, you should see:

cdh4-repository-1-0

In this case, skip to Step 3. If instead you see:

package cdh4-repository is not installed

proceed with this step.

  • On Ubuntu and Debian systems:
 dpkg -l | grep cdh4-repository

If the repository is installed, skip to Step 3; otherwise proceed with this step.

If the CDH4 "1-click" repository is not already installed on each host in the cluster, follow the instructions below for that host's operating system:

Instructions for Red Hat-compatible systems

Instructions for SLES systems

Instructions for Ubuntu and Debian systems

On Red Hat-compatible systems:

  1. Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).

    For OS Version

    Click this Link

    Red Hat/CentOS/Oracle 5

    Red Hat/CentOS/Oracle 5 link

    Red Hat/CentOS 6 (32-bit)

    Red Hat/CentOS 6 link (32-bit)

    Red Hat/CentOS/Oracle 6 (64-bit)

    Red Hat/CentOS/Oracle 6 link (64-bit)

  2. Install the RPM.
    For Red Hat/CentOS/Oracle 5:
    $ sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm
    For Red Hat/CentOS 6 (32-bit):
    $ sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.i386.rpm

    For Red Hat/CentOS/Oracle 6 (64-bit):

    $ sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm
  Note:

For instructions on how to add a CDH4 yum repository or build your own CDH4 yum repository, see Installing CDH4 On Red Hat-compatible systems.

On SLES systems:

  1. Click this link, choose Save File, and save it to a directory to which you have write access (it can be your home directory).
  2. Install the RPM:
$ sudo rpm -i cloudera-cdh-4-0.x86_64.rpm
  Note:

For instructions on how to add a repository or build your own repository, see Installing CDH4 on SLES Systems.

Now update your system package index by running:

$ sudo zypper refresh

On Ubuntu and Debian systems:

  1. Click one of the following: this link for a Squeeze system, or this link for a Lucid system, or this link for a Precise system.
  2. Install the package. Do one of the following:

    Choose Open with in the download window to use the package manager, or

    Choose Save File, save the package to a directory to which you have write access (it can be your home directory) and install it from the command line, for example:
    sudo dpkg -i cdh4-repository_1.0_all.deb
  Note:

For instructions on how to add a repository or build your own repository, see Installing CDH4 on Ubuntu Systems.

Step 3: Upgrade the packages on the appropriate hosts.

Upgrade MRv1, YARN, or both, depending on what you intend to use.

  Note:

Before installing MRv1 or YARN: (Optionally) add a repository key on each system in the cluster, if you have not already done so. Add the Cloudera Public GPG Key to your repository by executing one of the following commands:

  • For Red Hat/CentOS/Oracle 5 systems:
$ sudo rpm --import
http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera
  • For Red Hat/CentOS/Oracle 6 systems:
$ sudo rpm --import
http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
  • For all SLES systems:
$ sudo rpm --import
http://archive.cloudera.com/cdh4/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
  • For Ubuntu Lucid systems:
$ curl -s
http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh/archive.key
| sudo apt-key add -
  • For Ubuntu Precise systems:
$ curl -s
http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
  • For Debian Squeeze systems:
$ curl -s
http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh/archive.key
| sudo apt-key add -

Step 3a: If you are using MRv1, upgrade the MRv1 packages on the appropriate hosts.

Skip this step if you are using YARN exclusively. Otherwise upgrade each type of daemon package on the appropriate hosts as follows:

  1. Install and deploy ZooKeeper:
      Important:

    Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding. This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker.

    Follow instructions under ZooKeeper Installation.

  2. Install each type of daemon package on the appropriate systems(s), as follows.

    Where to install

    Install commands

    JobTracker host running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install
    hadoop-0.20-mapreduce-jobtracker

    SLES

    sudo zypper clean --all; sudo zypper install
    hadoop-0.20-mapreduce-jobtracker

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install
    hadoop-0.20-mapreduce-jobtracker

    NameNode host running:

     

    Red Hat/CentOS compatible

    sudo zypper clean --all; sudo yum clean all; sudo yum install
    hadoop-hdfs-namenode

    SLES

    sudo zypper clean --all; sudo zypper install
    hadoop-hdfs-namenode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install
    hadoop-hdfs-namenode

    Secondary NameNode host (if used) running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install
    hadoop-hdfs-secondarynamenode

    SLES

    sudo zypper clean --all; sudo zypper install
    hadoop-hdfs-secondarynamenode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install
    hadoop-hdfs-secondarynamenode

    All cluster hosts except the JobTracker, NameNode, and Secondary (or Standby) NameNode hosts, running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install
    hadoop-0.20-mapreduce-tasktracker
    hadoop-hdfs-datanode

    SLES

    sudo zypper clean --all; sudo zypper install
    hadoop-0.20-mapreduce-tasktracker
    hadoop-hdfs-datanode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install
    hadoop-0.20-mapreduce-tasktracker
    hadoop-hdfs-datanode

    All client hosts, running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-client

    SLES

    sudo zypper clean --all; sudo zypper install
    hadoop-client

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install
    hadoop-client

Step 3b: If you are using YARN, upgrade the YARN packages on the appropriate hosts.

Skip this step if you are using MRv1 exclusively. Otherwise upgrade each type of daemon package on the appropriate hosts as follows:

  1. Install and deploy ZooKeeper:
      Important:

    Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding. This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker.

    Follow instructions under ZooKeeper Installation.

  2. Install each type of daemon package on the appropriate systems(s), as follows.

    Where to install

    Install commands

    Resource Manager host (analogous to MRv1 JobTracker) running:

     

    Red Hat/CentOS compatible

    $ sudo yum clean all; sudo yum install hadoop-yarn-resourcemanager

    SLES

    $ sudo zypper clean --all; sudo zypper install hadoop-yarn-resourcemanager

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-yarn-resourcemanager

    NameNode host running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-hdfs-namenode

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-hdfs-namenode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-hdfs-namenode

    Secondary NameNode host (if used) running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-hdfs-secondarynamenode

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-hdfs-secondarynamenode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-hdfs-secondarynamenode

    All cluster hosts except the Resource Manager (analogous to MRv1 TaskTrackers) running:

     

    Red Hat/CentOS compatible

    $ sudo yum clean all; sudo yum install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce

    SLES

    $ sudo zypper clean --all; sudo zypper install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce

    Ubuntu or Debian

    $ sudo apt-get update; sudo apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce

    One host in the cluster running:

     

    Red Hat/CentOS compatible

    $ sudo yum clean all; sudo yum install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver

    SLES

    $ sudo zypper clean --all; sudo zypper install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver

    All client hosts, running:

     

    Red Hat/CentOS compatible

    $ sudo yum clean all; sudo yum install hadoop-client

    SLES

    sudo sudo zypper clean --all; zypper install hadoop-client

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-client

      Note:

    The hadoop-yarn and hadoop-hdfs packages are installed on each system automatically as dependencies of the other packages.

Step 4: Upgrade the HDFS Metadata (Beta 1 or earlier)

  Note: If you are already running CDH4 Beta 2 or later, skip this step

You do not need to upgrade the HDFS metadata. Proceed with starting HDFS.

To upgrade to the latest version of CDH4 from CDH4 Beta 1 or any earlier version of CDH, you must now upgrade the HDFS metadata on the NameNode.

  Note: Before you start

If you are using a high-availability (HA) configuration on CDH4 Beta 1, you must unconfigure HA before you can upgrade HDFS. See Upgrading an HA Configuration to the Latest Release.

  1. To upgrade the HDFS metadata, run the following command on the NameNode:
    $ sudo service hadoop-hdfs-namenode upgrade
      Note:

    The NameNode upgrade process can take a while depending on how many files you have.

    You can watch the progress of the upgrade by running:

    $ sudo tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode-<hostname>.log
    											

    Look for a line that confirms the upgrade is complete, such as: /var/lib/hadoop-hdfs/cache/hadoop/dfs/<name> is complete

  2. Start up the DataNodes:

    On each DataNode:

    $ sudo service hadoop-hdfs-datanode start
  3. Wait for NameNode to exit safe mode, and then start the Secondary NameNode (if used) and complete the cluster upgrade.
    1. To check that the NameNode has exited safe mode, look for messages in the log file, or the NameNode's web interface, that say "...no longer in safe mode."
    2. To start the Secondary NameNode (if used), enter the following command on the Secondary NameNode host:
      $ sudo service hadoop-hdfs-secondarynamenode start
    3. To complete the cluster upgrade, follow the remaining steps below.

Step 5: Start HDFS (Beta 2 or later)

  Note:

If you are upgrading the HDFS metadata, skip this step, and start the NameNode, DataNodes, and Secondary NameNode (if used) individually as described in the previous step.

for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

Step 5a: Verify that /tmp Exists and Has the Right Permissions

  Important:

If you do not create /tmp properly, with the right permissions as shown below, you may have problems with CDH components later. Specifically, if you don't create /tmp yourself, another process may create it automatically with restrictive permissions that will prevent your other applications from using it.

Create the /tmp directory after HDFS is up and running, and set its permissions to 1777 (drwxrwxrwt), as follows:

$ sudo -u hdfs hadoop fs -mkdir /tmp $ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
  Note:

If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command>

Step 6: Start MapReduce (MRv1) or YARN

You are now ready to start and test MRv1 or YARN.

Step 6a: Start MapReduce (MRv1)

  Important:

Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not supported; it will degrade your performance and may result in an unstable MapReduce cluster deployment. Steps 6a and 6b are mutually exclusive.

After you have verified HDFS is operating correctly, you are ready to start MapReduce. On each TaskTracker system:

$ sudo service hadoop-0.20-mapreduce-tasktracker start

On the JobTracker system:

$ sudo service hadoop-0.20-mapreduce-jobtracker start

Verify that the JobTracker and TaskTracker started properly.

$ sudo jps | grep Tracker

If the permissions of directories are not configured correctly, the JobTracker and TaskTracker processes start and immediately fail. If this happens, check the JobTracker and TaskTracker logs and set the permissions correctly.

Verify basic cluster operation for MRv1.

At this point your cluster is upgraded and ready to run jobs. Before running your production jobs, verify basic cluster operation by running an example from the Apache Hadoop web site.

Before you proceed, you make sure the HADOOP_HOME environment variable is unset:

$ unset HADOOP_HOME
  Note:

To submit MapReduce jobs using MRv1 in CDH4 Beta 1, you needed either to set the HADOOP_HOME environment variable or run a launcher script.

This is no longer true in later CDH4 releases; the HADOOP_HOME has been now fully deprecated and it is good practice to unset it.

  Note:

For important configuration information, see Deploying MapReduce v1 (MRv1) on a Cluster.

  1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
    sudo -u hdfs hadoop fs -mkdir /user/joe sudo -u hdfs hadoop fs -chown joe /user/joe

    Do the following steps as the user joe.

  2. Make a directory in HDFS called input and copy some XML files into it by running the following commands:
    $ hadoop fs -mkdir input 
    $ hadoop fs -put /etc/hadoop/conf/*.xml input 
    $ hadoop fs -ls input 
    Found 3 items: 
    -rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml
    -rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml 
    -rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml
  3. Run an example Hadoop job to grep with a regular expression in your input data.
    $ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'
  4. After the job completes, you can find the output in the HDFS directory named output because you specified that output directory to Hadoop.
    $ hadoop fs -ls 
    Found 2 items 
    drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input 
    drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output

    You can see that there is a new directory called output.

  5. List the output files.
    $ hadoop fs -ls output 
    Found 2 items 
    drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output/_logs 
    -rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output/part-00000 
    -rw-r--r- 1 joe supergroup 0 2009-02-25 10:33 /user/joe/output/_SUCCESS
  6. Read the results in the output file; for example:
    $ hadoop fs -cat output/part-00000 | head 
    1 dfs.datanode.data.dir 
    1 dfs.namenode.checkpoint.dir 
    1 dfs.namenode.name.dir 
    1 dfs.replication 
    1 dfs.safemode.extension 
    1 dfs.safemode.min.datanodes

You have now confirmed your cluster is successfully running CDH4.

  Important:

If you have client hosts, make sure you also update them to CDH4, and upgrade the components running on those clients as well.

Step 6b: Start MapReduce with YARN

  Important:

Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not supported; it will degrade your performance and may result in an unstable MapReduce cluster deployment. Steps 6a and 6b are mutually exclusive.

Before deciding to deploy YARN, make sure you read the discussion under New Features.

After you have verified HDFS is operating correctly, you are ready to start YARN. First, if you have not already done so, create directories and set the correct permissions.

  Note: For more information see Deploying MapReduce v2 (YARN) on a Cluster.
Create a history directory and set permissions; for example:
$ sudo -u hdfs hadoop fs -mkdir /user/history 
$ sudo -u hdfs hadoop fs -chmod -R 1777 /user/history  
 $ sudo -u hdfs hadoop fs -chown yarn /user/history 
Create the /var/log/hadoop-yarn directory and set ownership:
$ sudo -u hdfs hadoop fs -mkdir /var/log/hadoop-yarn  
$ sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn 
  Note: You need to create this directory because it is the parent of /var/log/hadoop-yarn/apps which is explicitly configured in the yarn-site.xml.

Verify the directory structure, ownership, and permissions:

$ sudo -u hdfs hadoop fs -ls -R / 
You should see:
drwxrwxrwt - hdfs supergroup 0 2012-04-19 14:31 /tmp  
drwxr-xr-x - hdfs supergroup 0 2012-05-31 10:26 /user  
drwxrwxrwt - yarn supergroup 0 2012-04-19 14:31 /user/history  
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var  
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var/log  
drwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn 

To start YARN, start the ResourceManager and NodeManager services:

  Note:

Make sure you always start ResourceManager before starting NodeManager services.

On the ResourceManager system:

$ sudo service hadoop-yarn-resourcemanager start 

On each NodeManager system (typically the same ones where DataNode service runs):

$ sudo service hadoop-yarn-nodemanager start 

To start the MapReduce JobHistory Server

On the MapReduce JobHistory Server system:

$ sudo service hadoop-mapreduce-historyserver start 

For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop in a YARN installation, set the HADOOP_MAPRED_HOME environment variable as follows:

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce 

Verify basic cluster operation for YARN.

At this point your cluster is upgraded and ready to run jobs. Before running your production jobs, verify basic cluster operation by running an example from the Apache Hadoop web site.

  Note:

For important configuration information, see Deploying MapReduce v2 (YARN) on a Cluster.

  1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
    $ sudo -u hdfs hadoop fs -mkdir /user/joe sudo -u hdfs hadoop fs -chown joe /user/joe

    Do the following steps as the user joe.

  2. Make a directory in HDFS called input and copy some XML files into it by running the following commands in pseudo-distributed mode:
    $ hadoop fs -mkdir input 
    $ hadoop fs -put /etc/hadoop/conf/*.xml input 
    $ hadoop fs -ls input 
    Found 3 items: 
    -rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml 
    -rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml 
    -rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml
  3. Set HADOOP_MAPRED_HOME for user joe:
    $ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
  4. Run an example Hadoop job to grep with a regular expression in your input data.
    $ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'
  5. After the job completes, you can find the output in the HDFS directory named output23 because you specified that output directory to Hadoop.
    $ hadoop fs -ls 
    Found 2 items 
    drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input 
    drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output23

    You can see that there is a new directory called output23.

  6. List the output files:
    $ hadoop fs -ls output23 
    Found 2 items 
    drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output23/_SUCCESS 
    -rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output23/part-r-00000
  7. Read the results in the output file:
    $ hadoop fs -cat output23/part-r-00000 | head 
    1 dfs.safemode.min.datanodes 
    1 dfs.safemode.extension 
    1 dfs.replication 
    1 dfs.permissions.enabled 
    1 dfs.namenode.name.dir 
    1 dfs.namenode.checkpoint.dir 
    1 dfs.datanode.data.dir

You have now confirmed your cluster is successfully running CDH4.

  Important:

If you have client hosts, make sure you also update them to CDH4, and upgrade the components running on those clients as well.

Step 7: Set the Sticky Bit

For security reasons Cloudera strongly recommends you set the sticky bit on directories if you have not already done so.

The sticky bit prevents anyone except the superuser, directory owner, or file owner from deleting or moving the files within a directory. (Setting the sticky bit for a file has no effect.) Do this for directories such as /tmp. (For instructions on creating /tmp and setting its permissions, see these instructions).

Step 8: Upgrade Components to CDH4

  Note:
  • For important information on new and changed components, see the Release Notes. To see whether there is a new version of a particular component in CDH 5, check the Version and Packaging Information.
  • Cloudera recommends that you regularly update the software on each system in the cluster (for example, on a RHEL-compatible system, regularly run yum update) to ensure that all the dependencies for any given component are up to date. (If you have not been in the habit of doing this, be aware that the command may take a while to run the first time you use it.)

To upgrade or add CDH components, see the following sections:

  • Flume. For more information, see "Upgrading Flume in CDH4" under "Flume Installation" in this guide.
  • Sqoop. For more information, see "Upgrading Sqoop to CDH4" under "Sqoop Installation" in this guide.
  • Sqoop 2. For more information, see "Sqoop 2 Installation" in this guide.
  • HCatalog. For more information, see "Installing and Using HCatalog" in this guide.
  • Hue. For more information, see "Upgrading Hue in CDH4" under "Hue Installation" in this guide.
  • Pig. For more information, see "Upgrading Pig to CDH4" under "Pig Installation" in this guide.
  • Hive. For more information, see "Upgrading Hive to CDH4" under "Hive Installation" in this guide.
  • HBase. For more information, see "Upgrading HBase to CDH4" under "HBase Installation" in this guide.
  • ZooKeeper. For more information, see "Upgrading ZooKeeper to CDH4" under "ZooKeeper Installation" in this guide.
  • Oozie. For more information, see "Upgrading Oozie to CDH4" under "Oozie Installation" in this guide.
  • Whirr. For more information, see "Upgrading Whirr to CDH4" under "Whirr Installation" in this guide.
  • Snappy. For more information, see "Upgrading Snappy to CDH4" under "Snappy Installation" in this guide.
  • Mahout. For more information, see "Upgrading Mahout to CDH4" under "Mahout Installation" in this guide.

Step 9: Apply Configuration File Changes

  Important:

During package upgrade, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave, and creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original configuration file to the new configuration file. In the case of Ubuntu and Debian upgrades, you will be prompted if you have made changes to a file for which there is a new version; for details, see Automatic handling of configuration files by dpkg.

For example, if you have modified your zoo.cfg configuration file (/etc/zookeeper/zoo.cfg), the upgrade renames and preserves a copy of your modified zoo.cfg as /etc/zookeeper/zoo.cfg.rpmsave. If you have not already done so, you should now compare this to the new /etc/zookeeper/conf/zoo.cfg, resolve differences, and make any changes that should be carried forward (typically where you have changed property value defaults). Do this for each component you upgrade.

Step 10: Finalize the HDFS Metadata Upgrade (Beta 1 or earlier)

  Note: Skip this step if you are upgrading from CDH4 Beta 2 or later.

To finalize the HDFS metadata upgrade you began earlier in this procedure, proceed as follows:

  1. Make sure you are satisfied that the CDH4 upgrade has succeeded and everything is running smoothly. This could take a matter of days, or even weeks.
      Warning:

    Do not proceed until you are sure you are satisfied with the new deployment. Once you have finalized the HDFS metadata, you cannot revert to an earlier version of HDFS.

      Note:

    If you need to restart the NameNode during this period (after having begun the upgrade process, but before you've run finalizeUpgrade) simply restart your NameNode without the -upgrade option.

  2. Finalize the HDFS metadata upgrade: use one of the following commands, depending on whether Kerberos is enabled (see Configuring Hadoop Security in CDH4).
    • If Kerberos is enabled:
      $ kinit -kt /path/to/hdfs.keytab hdfs/<fully.qualified.domain.name@YOUR-REALM.COM> && hdfs dfsadmin -finalizeUpgrade
    • If Kerberos is not enabled:
      $ sudo -u hdfs hdfs dfsadmin -finalizeUpgrade
  Note:

After the metadata upgrade completes, the previous/ and blocksBeingWritten/ directories in the DataNodes' data directories aren't cleared until the DataNodes are restarted.