The HDFS Balancer

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

HDFS data might not always be placed uniformly across DataNodes. One common reason is addition of new DataNodes to an existing cluster. HDFS provides a balancer utility that analyzes block placement and balances data across the DataNodes. It moves blocks until the cluster is deemed to be balanced, which means that the utilization of every DataNode (ratio of used space on the node to total capacity of the node) differs from the utilization of the cluster (ratio of used space on the cluster to total capacity of the cluster) by no more than a given threshold percentage. The balancer does not balance between individual volumes on a single DataNode.

In Cloudera Manager, the HDFS balancer utility is implemented by the Balancer role. The Balancer role usually shows a health of None on the HDFS Instances tab because it does not run continuously.

Running the Balancer

  1. Go to the HDFS service.
  2. Select Actions > Rebalance.
  3. Click Rebalance that appears in the next screen to confirm. If you see a Finished status, the Balancer ran successfully.

Configuring the Balancer Threshold

The Balancer has a default threshold of 10%, which ensures that disk usage on each DataNode differs from the overall usage in the cluster by no more than 10%. For example, if overall usage across all the DataNodes in the cluster is 40% of the cluster's total disk-storage capacity, the script ensures that DataNode disk usage is between 30% and 50% of the DataNode disk-storage capacity. To change the threshold:
  1. Go to the HDFS service.
  2. Click the Configuration tab.
  3. Expand the Balancer Default Group category.
  4. Set the Rebalancing Threshold property.
  5. Click Save Changes to commit the changes.