This is the documentation for CDH 4.6.0.
Documentation for other versions is available at Cloudera Documentation.

Introduction to HDFS High Availability

Overview

This guide provides an overview of the HDFS High Availability (HA) feature and how to configure and manage an HA HDFS cluster.

This document assumes that the reader has a general understanding of components and node types in an HDFS cluster. For details, see the Apache HDFS Architecture Guide.

Background

Before CDH4, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine. The Secondary NameNode did not provide failover capability.

This reduced the total availability of the HDFS cluster in two major ways:

  1. In the case of an unplanned event such as a machine crash, the cluster would be unavailable until an operator restarted the NameNode.
  2. Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in periods of cluster downtime.

The HDFS HA feature addresses the above problems by providing the option of running two NameNodes in the same cluster, in an Active/Passive configuration. These are referred to as the Active NameNode and the Standby NameNode. Unlike the Secondary NameNode, the Standby NameNode is hot standby, allowing a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance. You cannot have more than two NameNodes.

Architecture

In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.

CDH4 supports two major implementations:

  Important: Recommendation

Cloudera recommends Quorum-based Storage as the more robust solution. Shared storage using NFS is supported only in CDH4, not in CDH5. For instructions on switching to Quorum-based storage, see Switching from Shared Storage using NFS to Quorum-based Storage.

Quorum-based Storage

Quorum-based Storage refers to the HA implementation that uses Quorum Journal Manager (QJM).

In order for the Standby node to keep its state synchronized with the Active node in this implementation, both nodes communicate with a group of separate daemons called JournalNodes. When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JournalNodes. The Standby node is capable of reading the edits from the JournalNodes, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.

In order to provide a fast failover, it is also necessary that the Standby node has up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and they send block location information and heartbeats to both.

It is vital for the correct operation of an HA cluster that only one of the NameNodes be active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the JournalNodes will only ever allow a single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active NameNode to safely proceed with failover.

  Note: Because of this, fencing is not required in a Quorum-based Storage deployment, as it is with Shared Storage using NFS. But fencing is still useful with Quorum-based Storage; see Software Configuration for Quorum-based Storage.

Shared Storage Using NFS

In order for the Standby node to keep its state synchronized with the Active node, this implementation requires that the two nodes both have access to a directory on a shared storage device (for example, an NFS mount from a NAS).

When any namespace modification is performed by the Active node, it durably logs a record of the modification to an edit log file stored in the shared directory. The Standby node constantly watches this directory for edits, and when edits occur, the Standby node applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the shared storage before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.

In order to provide a fast failover, it is also necessary that the Standby node has up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and they send block location information and heartbeats to both.

It is vital for the correct operation of an HA cluster that only one of the NameNodes be active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this and prevent the so-called "split-brain scenario," the administrator must configure at least one fencing method for the shared storage. During a failover, if it cannot be verified that the previous Active NameNode has relinquished its Active state, the fencing process is responsible for cutting off the previous Active NameNode's access to the shared edits storage. This prevents it from making any further edits to the namespace, allowing the new Active NameNode to safely proceed with failover.