Upgrading Host Operating Systems in a CDH Cluster
This topic describes the steps to upgrade the operating system (OS) on hosts in an active CDH cluster.
- If you are upgrading the OS as well as CDH/CM, then upgrade the OS first.
- Read the release notes of the specific CDH version to ensure the new OS version is supported. Normally an upgrade to a minor version is fine. Read the release note of the new OS release to check for any new defaults or change in behaviour (eg: Transparent Huge Pages is on by default in some version of Red Hat 6)
- Confirm if any files have a replication factor less than the global default, verify that bringing down any node does not make any file unavailable
For each host in the cluster, do the following:
- If the host runs a DataNode service, consider whether it needs to be decommissioned. See Deciding Whether to Decommission DataNodes.
- Stop all Hadoop related services on the host.
- Take the host offline (for example, switch to single user mode or restart the host to boot off the network).
- Upgrade the OS partition, leaving the data partitions (for example, dfs.data.dir) alone.
- Bring the host back online.
- If the host is decommissioned, recommission it.
- Verify in the NameNode UI (or Cloudera Manager) that the host is healthy and all services are running.
Upgrading Hosts With High Availability Enabled
If you have enabled high availability for the NameNode or JobTracker, follow this procedure:
- Stop the backup NameNode or JobTracker.
- Upgrade the OS partition.
- Start the services and ensure they are running properly.
- Fail over to the backup NameNode or JobTracker.
- Upgrade the primary NameNode or JobTracker.
- Bring the primary NameNode or JobTracker back online.
- Reverse the failover.
Upgrading Hosts Without High Availability Enabled
If you have not enabled high availability, upgrading a primary host causes an outage for that service. The procedure to upgrade is the same as upgrading a secondary host, except that you must decommission and recommission the host. When upgrading hosts that are part of a Zookeeper quorum, ensure that the majority of the quorum is available. Cloudera recommends that you upgrade only one host at a time.
Deciding Whether to Decommission DataNodes
When a DataNode is decommissioned, the NameNode ensures that every block from the DataNode is still available across the cluster as dictated by the replication factor. This procedure involves copying blocks off the DataNode in small batches. In cases where a DataNode has several thousands of blocks, decommissioning takes several hours.
- The NameNode marks the DataNode as dead after a default of 10m30s (controlled by dfs.heartbeat.interval and dfs.heartbeat.recheck.interval).
- Soon after, the Namenode schedules the missing replicas to be placed on other DataNodes: this cannot be avoided.
- When the Datanode comes back online and reports to the NameNode, the NameNode schedules blocks to be copied to it while other nodes are decommissioned or when new files are written to HDFS.
If the OS upgrade procedure is quick (for example, under 30 mins per node), do not decommission the DataNode.
- dfs.max-repl-streams: The number of simultaneous streams to copy data.
- dfs.balance.bandwidthPerSec: Specifies the maximum amount of bandwidth that each DataNode can utilize for the balancing purpose in term of the number of bytes per second.
- dfs.namenode.replication.work.multiplier.per.iteration: NameNode configuration requiring a restart, defaults to 2 but can be raised to 10 or higher.
For more information, see Decommissioning and Recommissioning Hosts.