Minimum Required Role: Operator (also provided by Configurator, Cluster Administrator, Full Administrator)
How NameNode Manages Blocks on a Failed DataNode
After a period without any heartbeats (which by default is 10.5 minutes), a DataNode is assumed to be failed. The following describes how the NameNode manages block replication in such cases.
- NameNode determines which blocks were on the failed DataNode.
- NameNode locates other DataNodes with copies of these blocks.
- The DataNodes with block copies are instructed to copy those blocks to other DataNodes to maintain the configured replication factor.
- Follow the procedure in Replacing a Disk on a DataNode Host to bring a repaired DataNode back online.
Replacing a Disk on a DataNode Host
If one of your DataNode hosts experiences a disk failure, follow this process to replace the disk:
- Stop managed services.
- Decommission the DataNode role instance.
- Replace the failed disk.
- Recommission the DataNode role instance.
- Run the HDFS fsck utility to validate the health of HDFS. The utility normally reports over-replicated blocks immediately after a DataNode is reintroduced to the cluster, which is automatically corrected over time.
- Start managed services.