DataNodes store data in a Hadoop cluster and is the name of the daemon that manages the data. File data is replicated on multiple DataNodes for reliability and so that localized computation can be executed near the data. Within a cluster, DataNodes should be uniform. If they are not uniform, issues can occur. For example, DataNodes with less memory fill up more quickly than DataNodes with more memory, which can result in job failures.

How NameNode Manages Blocks on a Failed DataNode

A DataNode is considered dead after a set period without any heartbeats (10.5 minutes by default). When this happens, the NameNode performs the following actions to maintain the configured replication factor (3x replication by default):
  1. The NameNode determines which blocks were on the failed DataNode.
  2. The NameNode locates other DataNodes with copies of these blocks.
  3. The DataNodes with block copies are instructed to copy those blocks to other DataNodes to maintain the configured replication factor.
Keep the following in mind when working with dead DataNodes:
  • If the DataNode failed due to a disk failure, follow the procedure in Replacing a Disk on a DataNode Host or Performing Disk Hot Swap for DataNodes to bring a repaired DataNode back online. If a DataNode failed to heartbeat for other reasons, they need to be recommissioned to be added back to the cluster. For more information, see Recommissioning Hosts
  • If a DataNode rejoins the cluster, there is a possibility for surplus replicas of blocks that were on that DataNode. The NameNode will randomly remove excess replicas adhering to Rack-Awareness policies.

Replacing a Disk on a DataNode Host

Minimum Required Role: Operator (also provided by Configurator, Cluster Administrator, Full Administrator)

For CDH 5.3 and higher, see Performing Disk Hot Swap for DataNodes.

If one of your DataNode hosts experiences a disk failure, follow this process to replace the disk:
  1. Stop managed services.
  2. Decommission the DataNode role instance.
  3. Replace the failed disk.
  4. Recommission the DataNode role instance.
  5. Run the HDFS fsck utility to validate the health of HDFS. The utility normally reports over-replicated blocks immediately after a DataNode is reintroduced to the cluster, which is automatically corrected over time.
  6. Start managed services.

Removing a DataNode

Minimum Required Role: Operator (also provided by Configurator, Cluster Administrator, Full Administrator)

  1. The number of DataNodes in your cluster must be greater than or equal to the replication factor you have configured for HDFS. (This value is typically 3.) In order to satisfy this requirement, add the DataNode roles on other hosts as required and start the role instances before removing any DataNodes .
  2. Ensure the DataNode that is to be removed is running
  3. Decommission the DataNode role. When asked to select the role instance to decommission, select the DataNode role instance.
  4. The decommissioning process moves the data blocks to the other available DataNodes.
  5. Once decommissioning is completed, stop the DataNode role. When asked to select the role instance to stop, select the DataNode role instance.
  6. Verify that the integrity of the HDFS service:
    1. Run the following command to identify any problems in the HDFS file system:
      hdfs fsck /
    2. Fix any errors reported by the fsck command. If required, create a Cloudera support case.
  7. After all errors are resolved:
    1. Remove the DataNode role.
    2. Manually remove the DataNode data directories. You can determine the location of these directories by examining the DataNode Data Directory property in the HDFS configuration. In Cloudera Manager, go to the HDFS service, select the Configuration tab and search for the property.

Fixing Block Inconsistencies

You can use the output of hdfs fsck or hdfs dfsadmin -report commands for information about inconsistencies with the HDFS data blocks such as missing, misreplicated, or underreplicated blocks. You can adopt different methods to address these inconsistencies.

Missing blocks: Ensure that the DataNodes in your cluster and the disks running on them are healthy. This should help in recovering those blocks that have recoverable replicas (indicated as missing blocks). If a file contains corrupt or missing blocks that cannot be recovered, then the file would be missing data, and all this data starting from the missing block becomes inaccessible through the CLI tools and FileSystem API. In most cases, the only solution is to delete the data file (by using the hdfs fsck <path> -delete command) and recover the data from another source.

Underreplicated blocks: HDFS automatically attempts to fix this issue by replicating the underreplicated blocks to other DataNodes and match the replication factor. If the automatic replication does not work, you can run the HDFS Balancer to address the issue.

Misreplicated blocks: Run the hdfs fsck -replicate command to trigger the replication of misreplicated blocks. This ensures that the blocks are correctly replicated across racks in the cluster.