Tuning and Troubleshooting Host Decommissioning

Decommissioning a host decommissions and stops all roles on the host without requiring you to individually decommission the roles on each service. The decommissioning process can take a long time and uses a great deal of cluster resources, including network bandwidth. You can tune the decommissioning process to improve performance and mitigate the performance impact on the cluster.

You can use the Decommission and Recommission features to perform minor maintenance on cluster hosts using Cloudera Manager to manage the process. See Performing Maintenance on a Cluster Host.

Continue reading:

Tuning HDFS Prior to Decommissioning DataNodes
Tuning HBase Prior to Decommissioning DataNodes
Performance Considerations
- Troubleshooting Performance of Decommissioning

Tuning HDFS Prior to Decommissioning DataNodes

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

When a DataNode is decommissioned, the NameNode ensures that every block from the DataNode will still be available across the cluster as dictated by the replication factor. This procedure involves copying blocks from the DataNode in small batches. If a DataNode has thousands of blocks, decommissioning can take several hours. Before decommissioning hosts with DataNodes, you should first tune HDFS:

Run the following command to identify any problems in the HDFS file system:

hdfs fsck / -list-corruptfileblocks -openforwrite -files -blocks -locations 2>&1 > /tmp/hdfs-fsck.txt

Fix any issues reported by the fsck command. If the command output lists corrupted files, use the fsck command to move them to the lost+found directory or delete them:
```
hdfs fsck file_name -move
```
or
```
hdfs fsck file_name -delete
```
Raise the heap size of the DataNodes. DataNodes should be configured with at least 4 GB heap size to allow for the increase in iterations and max streams.
1. Go to the HDFS service page.
2. Click the Configuration tab.
3. Select Scope > DataNode.
4. Select Category > Resource Management.
5. Set the Java Heap Size of DataNode in Bytes property as recommended.
  To apply this configuration property to other role groups as needed, edit the value for the appropriate role group. See Modifying Configuration Properties Using Cloudera Manager.
6. Enter a Reason for change, and then click Save Changes to commit the changes.
Increase the replication work multiplier per iteration to a larger number (the default is 2, however 10 is recommended):
1. Select Scope > NameNode.
2. Expand the Category > Advanced category.
3. Configure the Replication Work Multiplier Per Iteration property to a value such as 10.
  To apply this configuration property to other role groups as needed, edit the value for the appropriate role group. See Modifying Configuration Properties Using Cloudera Manager.
4. Enter a Reason for change, and then click Save Changes to commit the changes.
Increase the replication maximum threads and maximum replication thread hard limits:
1. Select Scope > NameNode.
2. Expand the Category > Advanced category.
3. Configure the Maximum number of replication threads on a DataNode and Hard limit on the number of replication threads on a DataNode properties to 50 and 100 respectively. You can decrease the number of threads (or use the default values) to minimize the impact of decommissioning on the cluster, but the trade off is that decommissioning will take longer.
  To apply this configuration property to other role groups as needed, edit the value for the appropriate role group. See Modifying Configuration Properties Using Cloudera Manager.
4. Enter a Reason for change, and then click Save Changes to commit the changes.
Restart the HDFS service.

For additional tuning recommendations, see Performance Considerations.

Tuning HBase Prior to Decommissioning DataNodes

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

To increase the speed of a rolling restart of the HBase service, set the Region Mover Threads property to a higher value. This increases the number of regions that can be moved in parallel, but places additional strain on the HMaster. In most cases, Region Mover Threads should be set to 5 or lower.

Performance Considerations

Decommissioning a DataNode does not happen instantly because the process requires replication of a potentially large number of blocks. During decommissioning, the performance of your cluster may be impacted. This section describes the decommissioning process and suggests solutions for several common performance issues.

Decommissioning occurs in two steps:

The Commission State of the DataNode is marked as Decommissioning and the data is replicated from this node to other available nodes. Until all blocks are replicated, the node remains in a Decommissioning state. You can view this state from the NameNode Web UI. (Go to the HDFS service and select Web UI > NameNode Web UI.)
When all data blocks are replicated to other nodes, the node is marked as Decommissioned.

Decommissioning can impact performance in the following ways:

There must be enough disk space on the other active DataNodes for the data to be replicated. After decommissioning, the remaining active DataNodes have more blocks and therefore decommissioning these DataNodes in the future may take more time.
There will be increased network traffic and disk I/O while the data blocks are replicated.
Data balance and data locality can be affected, which can lead to a decrease in performance of any running or submitted jobs.
Decommissioning a large numbers of DataNodes at the same time can decrease performance.
If you are decommissioning a minority of the DataNodes, the speed of data reads from these nodes limits the performance of decommissioning because decommissioning maxes out network bandwidth when reading data blocks from the DataNode and spreads the bandwidth used to replicate the blocks among other DataNodes in the cluster. To avoid performance impacts in the cluster, Cloudera recommends that you only decommission a minority of the DataNodes at the same time.
You can decrease the number of replication threads to decrease the performance impact of the replications, but this will cause the decommissioning process to take longer to complete. See Tuning HDFS Prior to Decommissioning DataNodes.

Cloudera recommends that you add DataNodes and decommission DataNodes in parallel, in smaller groups. For example, if the replication factor is 3, then you should add two DataNodes and decommission two DataNodes at the same time.

Troubleshooting Performance of Decommissioning

The following conditions can also impact performance when decommissioning DataNodes:

Open Files

Write operations on the DataNode do not involve the NameNode. If there are blocks associated with open files located on a DataNode, they are not relocated until the file is closed. This commonly occurs with:

Clusters using HBase
Open Flume files
Long running tasks

To find open files, run the following command:

hdfs dfsadmin -listOpenFiles -blockingDecommission

The command returns output similar to the following example:

Client Host         Client Name         Open File Path
172.26.12.77        DFSClient_NONMAPREDUCE_-698274460_1 /hbase/oldWALs/dn3.cloudera.com%2C22101%2C1540973344249.dn3.cloudera.com%2C22101%2C1540973344249.regiongroup-0.154099857098

After you find the open files, perform the appropriate action to restart process to close the file. For example, major compaction closes all files in a region for HBase.

Alternatively, you may evict writers to those decommissioning DataNodes with the following command:

hdfs dfsadmin -evictWriters <datanode_host:ipc_port>

For example:

hdfs dfsadmin -evictWriters datanode1:20001

A block cannot be relocated because there are not enough DataNodes to satisfy the block placement policy.

For example, for a 10 node cluster, if the mapred.submit.replication is set to the default of 10 while attempting to decommission one DataNode, there will be difficulties relocating blocks that are associated with map/reduce jobs. This condition will lead to errors in the NameNode logs similar to the following:

org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault: Not able to place enough replicas, still in need of 3 to reach 3

Use the following steps to find the number of files where the block replication policy is equal to or above your current cluster size:

Provide a listing of open files, their blocks, the locations of those blocks by running the following command:
```
hadoop fsck / -files -blocks -locations -openforwrite 2>&1 > openfiles.out
```
Run the following command to return a list of how many files have a given replication factor:
```
grep repl= openfiles.out | awk '{print $NF}' | sort | uniq -c
```
For example, when the replication factor is 10 , and decommissioning one:
```
egrep -B4 "repl=10" openfiles.out | grep -v '<dir>' | awk '/^\//{print $1}'
```
Examine the paths, and decide whether to reduce the replication factor of the files, or remove them from the cluster.

Performing Maintenance on a Cluster Host

Maintenance Mode