Introduction to HDFS High Availability
In a standard configuration, the NameNode is a single point of failure (SPOF) in an HDFS cluster. Each cluster has a single NameNode, and if that host or process became unavailable, the cluster as a whole is unavailable until the NameNode is either restarted or brought up on a new host. The Secondary NameNode does not provide failover capability.
- In the case of an unplanned event such as a host crash, the cluster is unavailable until an operator restarts the NameNode.
- Planned maintenance events such as software or hardware upgrades on the NameNode machine result in periods of cluster downtime.
HDFS HA addresses the above problems by providing the option of running two NameNodes in the same cluster, in an active/passive configuration. These are referred to as the active NameNode and the standby NameNode. Unlike the Secondary NameNode, the standby NameNode is hot standby, allowing a fast automatic failover to a new NameNode in the case that a host crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance. You cannot have more than two NameNodes.
Cloudera Manager 5 and CDH 5 support Quorum-based Storage as the only HA implementation. In contrast, CDH 4 supports both Quorum-based Storage and shared storage using NFS. For instructions on switching to Quorum-based storage, see Converting From an NFS-mounted Shared Edits Directory to Quorum-based Storage.
Quorum-based Storage refers to the HA implementation that uses a Quorum Journal Manager (QJM).
In order for the standby NameNode to keep its state synchronized with the active NameNode in this implementation, both nodes communicate with a group of separate daemons called JournalNodes. When any namespace modification is performed by the active NameNode, it durably logs a record of the modification to a majority of the JournalNodes. The standby NameNode is capable of reading the edits from the JournalNodes, and is constantly watching them for changes to the edit log. As the standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the standby ensures that it has read all of the edits from the JournalNodes before promoting itself to the active state. This ensures that the namespace state is fully synchronized before a failover occurs.
In order to provide a fast failover, it is also necessary that the standby NameNode has up-to-date information regarding the location of blocks in the cluster. In order to achieve this, DataNodes are configured with the location of both NameNodes, and they send block location information and heartbeats to both.
Shared Storage Using NFS
In order for the standby NameNode to keep its state synchronized with the active NameNode, this implementation requires that the two nodes both have access to a directory on a shared storage device (for example, an NFS mount from a NAS).
When any namespace modification is performed by the active NameNode, it durably logs a record of the modification to an edit log file stored in the shared directory. The standby NameNode constantly watches this directory for edits, and when edits occur, the standby NameNode applies them to its own namespace. In the event of a failover, the standby will ensure that it has read all of the edits from the shared storage before promoting itself to the active state. This ensures that the namespace state is fully synchronized before a failover occurs.
In order to prevent the "split-brain scenario" in which the namespace state is diverged between the two NameNodes, an administrator must configure at least one fencing method for the shared storage. If, during a failover, it cannot be verified that the previous active NameNode has relinquished its active state, the fencing process is responsible for cutting off the previous active NameNode's access to the shared edits storage. This prevents it from making any further edits to the namespace, allowing the new active NameNode to safely proceed with failover.
Automatic Failover Implementation
Automatic failover relies on two additional components in an HDFS: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC). In Cloudera Manager, the ZKFC process maps to the HDFS Failover Controller role.
Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of HDFS automatic failover relies on ZooKeeper for the following functions:
- Failure detection - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should be triggered.
- Active NameNode election - ZooKeeper provides a simple mechanism to exclusively elect a node as active. If the current active NameNode crashes, another node can take a special exclusive lock in ZooKeeper indicating that it should become the next active NameNode.
The ZKFailoverController (ZKFC) is a ZooKeeper client that also monitors and manages the state of the NameNode. Each of the hosts that run a NameNode also run a ZKFC. The ZKFC is responsible for:
- Health monitoring - the ZKFC contacts its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds promptly with a healthy status, the ZKFC considers the NameNode healthy. If the NameNode has crashed, frozen, or otherwise entered an unhealthy state, the health monitor marks it as unhealthy.
- ZooKeeper session management - when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special lock znode. This lock uses ZooKeeper's support for "ephemeral" nodes; if the session expires, the lock node is automatically deleted.
- ZooKeeper-based election - if the local NameNode is healthy, and the ZKFC sees that no other NameNode currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has "won the election", and is responsible for running a failover to make its local NameNode active. The failover process is similar to the manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions to active state.