Configuring the Blocksize for HBase

The blocksize is an important configuration option for HBase. HBase data is stored in one (after a major compaction) or more (possibly before a major compaction) HFiles per column family per region. It determines both of the following:
  • The blocksize for a given column family determines the smallest unit of data HBase can read from the column family's HFiles.
  • It is also the basic unit of measure cached by a RegionServer in the BlockCache.

The default blocksize is 64 KB. The appropriate blocksize is dependent upon your data and usage patterns. Use the following guidelines to tune the blocksize size, in combination with testing and benchmarking as appropriate.

  • Consider the average key/value size for the column family when tuning the blocksize. You can find the average key/value size using the HFile utility:
    $ hbase org.apache.hadoop.hbase.io.hfile.HFile -f /path/to/HFILE -m -v
    ...
    Block index size as per heapsize: 296
    reader=hdfs://srv1.example.com:9000/path/to/HFILE, \
    compression=none, inMemory=false, \
    firstKey=US6683275_20040127/mimetype:/1251853756871/Put, \
    lastKey=US6684814_20040203/mimetype:/1251864683374/Put, \
    avgKeyLen=37, avgValueLen=8, \
    entries=1554, length=84447
    ...
  • Consider the pattern of reads to the table or column family. For instance, if it is common to scan for 500 rows on various parts of the table, performance might be increased if the blocksize is large enough to encompass 500-1000 rows, so that often, only one read operation on the HFile is required. If your typical scan size is only 3 rows, returning 500-1000 rows would be overkill.

    It is difficult to predict the size of a row before it is written, because the data will be compressed when it is written to the HFile. Perform testing to determine the correct blocksize for your data.

Configuring the Blocksize for a Column Family

You can configure the blocksize of a column family at table creation or by disabling and altering an existing table. These instructions are valid whether or not you use Cloudera Manager to manage your cluster.

hbase> create ‘test_table′,{NAME => ‘test_cf′, BLOCKSIZE => '262144'}
hbase> disable 'test_table'
hbase> alter 'test_table', {NAME => 'test_cf', BLOCKSIZE => '524288'}
hbase> enable 'test_table'
After changing the blocksize, the HFiles will be rewritten during the next major compaction. To trigger a major compaction, issue the following command in HBase Shell.
hbase> major_compact 'test_table'

Depending on the size of the table, the major compaction can take some time and have a performance impact while it is running.

Monitoring Blocksize Metrics

Several metrics are exposed for monitoring the blocksize by monitoring the blockcache itself. See the block_cache* entries in RegionServer Metrics.