Upgrading Host Operating Systems in a CDH Cluster (version 5.12)

This topic describes the steps to upgrade the operating system (OS) on hosts in an active CDH cluster.

Prerequisites

  1. If you are upgrading CDH or Cloudera Manager as well as the OS, upgrade the OS first.
  2. Read the release notes of the specific CDH version to ensure that the new OS version is supported. In most cases, upgrading to a minor version of the OS is supported. Read the release note of the new OS release to check for new default settings or changes in behavior. (For example, Transparent Huge Pages is on by default in some versions of RHEL 6).
  3. If any files have a replication factor lower than the global default, verify that bringing down any node does not make the file unavailable.

Upgrading Hosts

For each host in the cluster, do the following:

  1. If the host runs a DataNode service, determine whether it needs to be decommissioned. See Deciding Whether to Decommission DataNodes.
  2. Stop all Hadoop-related services on the host.
  3. Take the host offline (for example, switch to single-user mode or restart the host to boot off the network).
  4. Upgrade the OS partition, leaving the data partitions (for example, dfs.data.dir) unchanged.
  5. Bring the host back online.
  6. If the host is decommissioned, recommission it.
  7. Verify in the NameNode UI (or Cloudera Manager) that the host is healthy and all services are running.

Upgrading Hosts With High Availability Enabled

If you have enabled high availability for the NameNode or JobTracker, follow this procedure:

  1. Stop the backup NameNode or JobTracker.
  2. Upgrade the OS partition.
  3. Start the services and ensure that they are running properly.
  4. Fail over to the backup NameNode or JobTracker.
  5. Upgrade the primary NameNode or JobTracker.
  6. Bring the primary NameNode or JobTracker back online.
  7. Reverse the failover.

Upgrading Hosts Without High Availability Enabled

If you have not enabled high availability, upgrading a primary host causes an outage for that service. The procedure to upgrade is the same as upgrading a secondary host, except that you must decommission and recommission the host. When upgrading hosts that are part of a ZooKeeper quorum, ensure that the majority of the quorum is available. Cloudera recommends that you upgrade only one host at a time.

Deciding Whether to Decommission DataNodes

When a DataNode is decommissioned, the NameNode ensures that every block from the DataNode is still available across the cluster as specified by the replication factor. This procedure involves copying blocks off the DataNode in small batches. In cases where a DataNode has several thousand blocks, decommissioning takes several hours.

When a DataNode is turned off without being decommissioned:
  • The NameNode marks the DataNode as dead after a default of 10m 30s (controlled by dfs.heartbeat.interval and dfs.heartbeat.recheck.interval).
  • The NameNode schedules the missing replicas to be placed on other DataNodes.
  • When the DataNode comes back online and reports to the NameNode, the NameNode schedules blocks to be copied to it while other nodes are decommissioned or when new files are written to HDFS.

If the OS upgrade procedure is quick (for example, under 30 mins per node), do not decommission the DataNode.

Speed up the decommissioning of a DataNode by increasing values for these properties:
  • dfs.max-repl-streams: The number of simultaneous streams to copy data.
  • dfs.balance.bandwidthPerSec: The maximum amount of bandwidth that each DataNode can utilize for balancing, in bytes per second.
  • dfs.namenode.replication.work.multiplier.per.iteration: NameNode configuration requiring a restart, defaults to 2 but can be raised to 10 or higher.

    This determines the total amount of block transfers to begin in parallel at a DataNode for replication, when such a command list is being sent over a DataNode heartbeat by the NameNode. The actual number is obtained by multiplying this value by the total number of live nodes in the cluster. The result number is the number of blocks to transfer immediately, per DataNode heartbeat.

For more information, see Decommissioning and Recommissioning Hosts.