The HDFS Balancer
The HDFS balancer re-balances data across the DataNodes, moving blocks from over-utilized to under-utilized nodes. As the system administrator, you can run the balancer from the command-line as necessary -- for example, after adding new DataNodes to the cluster.
- The balancer requires the capabilities of an HDFS superuser (for example, the hdfs user) to run.
- You can run the balancer without parameters, as
sudo -u hdfs hdfs balancerNoteThis runs the balancer with a default threshold of 10%, meaning that the script will ensure that disk usage on each DataNode differs from the overall usage in the cluster by no more than 10%. For example, if overall usage across all the DataNodes in the cluster is 40% of the cluster's total disk-storage capacity, the script ensures that each DataNode's disk usage is between 30% and 50% of that DataNode's disk-storage capacity.
If Kerberos is enabled, do not use commands in the form sudo -u <user> hadoop <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command>
- You can run the script with a different threshold; for
sudo -u hdfs hdfs balancer -threshold 5This specifies that each DataNode's disk usage must be (or will be adjusted to be) within 5% of the cluster's overall usage.
- You can adjust the network bandwidth used by the balancer, by
running the dfsadmin
-setBalanacerBandwidth command before you run the balancer; for
dfsadmin -setBalanacerBandwidth newbandwidthwhere newbandwidth is the maximum amount of network bandwidth, in bytes per second, that each DataNode can use during the balancing operation. For more information about the bandwidth command, see this page.
- The balancer can take a long time to run, especially if you are running it for the first time, or do not run it regularly.
For more information, see the Balancer Administrator Guide.