Storing Medium Objects (MOBs) in HBase

Data comes in many sizes, and saving all of your data in HBase, including binary data such as images and documents, is convenient. HBase can technically handle binary objects with cells that are up to 10 MB in size. However, HBase normal read and write paths are optimized for values smaller than 100 KB in size. When HBase handles large numbers of values up to 10 MB (medium objects, or MOBs), performance is degraded because of write amplification caused by splits and compactions.

One way to solve this problem is by storing objects larger than 100KB directly in HDFS, and storing references to their locations in HBase. CDH 5.4 and higher includes optimizations for storing MOBs directly in HBase) based on HBASE-11339.

To use MOB, you must use HFile version 3. Optionally, you can configure the MOB file reader's cache settings Service-Wide and for each RegionServer, and then configure specific columns to hold MOB data. No change to client code is required for HBase MOB support.

Enabling HFile Version 3 Using Cloudera Manager

Minimum Required Role: Full Administrator

To enable HFile version 3 using Cloudera Manager, edit the HBase Service Advanced Configuration Snippet for HBase Service-Wide.
  1. Go to the HBase service.
  2. Click the Configuration tab.
  3. Search for the property HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml.
  4. Paste the following XML into the Value field and save your changes.
    <property>
      <name>hfile.format.version</name>
      <value>3</value>
    </property>
Changes will take effect after the next major compaction.

Enabling HFile Version 3 Using the Command Line

Paste the following XML into hbase-site.xml.
<property>
  <name>hfile.format.version</name>
  <value>3</value>
</property>

Restart HBase. Changes will take effect for a given region during its next major compaction.

Configuring Columns to Store MOBs

Use the following options to configure a column to store MOBs:
  • IS_MOB is a Boolean option, which specifies whether or not the column can store MOBs.
  • MOB_THRESHOLD configures the number of bytes at which an object is considered to be a MOB. If you do not specify a value for MOB_THRESHOLD, the default is 100 KB. If you write a value larger than this threshold, it is treated as a MOB.

You can configure a column to store MOBs using the HBase Shell or the Java API.

Using HBase Shell:

hbase> create 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD => 102400}
hbase> alter 't1', {NAME => 'f1', IS_MOB => true, MOB_THRESHOLD =>
102400}

Using the Java API:

HColumnDescriptor hcd = new HColumnDescriptor(“f”);
hcd.setMobEnabled(true);
hcd.setMobThreshold(102400L);

HBase MOB Cache Properties

Because there can be a large number of MOB files at any time, as compared to the number of HFiles, MOB files are not always kept open. The MOB file reader cache is a LRU cache which keeps the most recently used MOB files open.

The following properties are available for tuning the HBase MOB cache.
HBase MOB Cache Properties
Property Default Description
hbase.mob.file.cache.size 1000 The of opened file handlers to cache. A larger value will benefit reads by providing more file handlers per MOB file cache and would reduce frequent file opening and closing of files. However, if the value is too high, errors such as "Too many opened file handlers" may be logged.
hbase.mob.cache.evict.period 3600 The amount of time in seconds after a file is opened before the MOB cache evicts cached files. The default value is 3600 seconds.
hbase.mob.cache.evict.remain.ratio 0.5f The ratio, expressed as a float between 0.0 and 1.0, that controls how manyfiles remain cached after an eviction is triggered due to the number of cached files exceeding the hbase.mob.file.cache.size. The default value is 0.5f.

Configuring the MOB Cache Using Cloudera Manager

To configure the MOB cache within Cloudera Manager, edit the HBase Service advanced configuration snippet for the cluster. Cloudera recommends testing your configuration with the default settings first.
  1. Go to the HBase service.
  2. Click the Configuration tab.
  3. Search for the property HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml.
  4. Paste your configuration into the Value field and save your changes. The following example sets the hbase.mob.cache.evict.period property to 5000 seconds. See HBase MOB Cache Properties for a full list of configurable properties for HBase MOB.
    <property>
      <name>hbase.mob.cache.evict.period</name>
      <value>5000</value>
    </property>
  5. Restart your cluster for the changes to take effect.

Configuring the MOB Cache Using the Command Line

Because there can be a large number of MOB files at any time, as compared to the number of HFiles, MOB files are not always kept open. The MOB file reader cache is a LRU cache which keeps the most recently used MOB files open.
To customize the configuration of the MOB file reader's cache on each RegionServer, configure the MOB cache properties in the RegionServer's hbase-site.xml. Customize the configuration to suit your environment, and restart or rolling restart the RegionServer. Cloudera recommends testing your configuration with the default settings first. The following example sets the hbase.mob.cache.evict.period property to 5000 seconds. See HBase MOB Cache Properties for a full list of configurable properties for HBase MOB.
<property>
  <name>hbase.mob.cache.evict.period</name>
  <value>5000</value>
</property>

Testing MOB Storage and Retrieval Performance

HBase provides the Java utility org.apache.hadoop.hbase.IntegrationTestIngestMOB to assist with testing the MOB feature and deciding on appropriate configuration values for your situation. The utility is run as follows:
$ sudo -u hbase hbase org.apache.hadoop.hbase.IntegrationTestIngestMOB \
            -threshold 102400 \
            -minMobDataSize 512 \
            -maxMobDataSize 5120
  • threshold is the threshold at which cells are considered to be MOBs. The default is 1 kB, expressed in bytes.
  • minMobDataSize is the minimum value for the size of MOB data. The default is 512 B, expressed in bytes.
  • maxMobDataSize is the maximum value for the size of MOB data. The default is 5 kB, expressed in bytes.

Compacting MOB Files Manually

You can trigger manual compaction of MOB files manually, rather than waiting for them to be triggered by your configuration, using the HBase Shell commands compact_mob and major_compact_mob. Each of these commands requires the first parameter to be the table name, and takes an optional column family name as the second argument. If the column family is provided, only that column family's files are compacted. Otherwise, all MOB-enabled column families' files are compacted.
hbase> compact_mob 't1'
hbase> compact_mob 't1', 'f1'
hbase> major_compact_mob 't1'
hbase> major_compact_mob 't1', 'f1'

This functionality is also available using the API, using the Admin.compact and Admin.majorCompact methods.