This is the documentation for CDH 5.1.x.
Documentation for other versions is available at Cloudera Documentation.

New Features and Changes for HBase in CDH 5

CDH 5.0.x and 5.1.x each include major upgrades to HBase. Each of these upgrades provides exciting new features, as well as things to keep in mind when upgrading from a previous version.

For new features introduced in CDH 5.0.x, skip to CDH 5.0.x HBase Changes.

CDH 5.1 HBase Changes

CDH 5.1 introduces HBase 0.98, which represents a major upgrade to HBase. This upgrade introduces several new features, including a section of features which are considered experimental and should not be used in a production environment. This overview provides information about the most important features, how to use them, and where to find out more information. Cloudera appreciates your feedback about these features.

In addition to HBase 0.98, Cloudera has pulled in changes from HBASE-10883, HBASE-10964, HBASE-10823, HBASE-10916, and HBASE-11275. Implications of these changes are detailed below and in the Release Notes.

BucketCache Block Cache

A new offheap BlockCache implementation, BucketCache, was introduced as an experimental feature in CDH 5 Beta 1, and is now fully supported in CDH 5.1. BucketCache can be used in either of the following two configurations:
  • As a CombinedBlockCache with both onheap and offheap caches.
  • As an L2 cache for the default onheap LruBlockCache

BucketCache requires less garbage-collection than SlabCache, which is the other offheap cache implementation in HBase. It also has many optional configuration settings for fine-tuning. All available settings are documented in the API documentation for CombinedBlockCache. Following is a simple example configuration.

  1. First, edit hbase-env.xml and set -XX:MaxDirectMemorySize to the total size of the desired onheap plus offheap, in this case, 5 GB (but expressed as 5G).
  2. Next, add the following configuration to hbase-site.xml. This configuration uses 80% of the -XX:MaxDirectMemorySize (4 GB) for offheap, and the remainder (1 GB) for onheap.
  3. Restart or rolling restart your cluster for the configuration to take effect.

Access Control for EXEC Permissions

A new access control level has been added to check whether a given user has EXEC permission. This can be specified at the level of the cluster, table, row, or cell.

To use EXEC permissions, perform the following procedure.
  • Install the AccessController coprocessor either as a system coprocessor or on a table as a table coprocessor.
  • Set the configuration setting in hbase-site.xml to true.

For more information on setting and revoking security permissions, see the Access Control section of the Apache HBase Reference Guide.

Reverse Scan API

A reverse scan API has been introduced. This allows you to scan a table in reverse. Previously, if you wanted to be able to access your data in either direction, you needed to store the data in two separate tables, each ordered differently. This feature was implemented in HBASE-4811.

To use the reverse scan feature, use the new Scan.setReversed(boolean reversed) API. If you specify a startRow and stopRow, to scan in reverse, the startRow needs to be lexicographically after the stopRow. See the Scan API documentation for more information.

MapReduce Over Snapshots

You can now run a MapReduce job over a snapshot from HBase, rather than being limited to live data. This provides the ability to separate your client-side work load from your live cluster if you need to run resource-intensive MapReduce jobs and can tolerate using potentially-stale data. You can either run the MapReduce job on the snapshot within HBase, or export the snapshot and run the MapReduce job against the exported file.

Running a MapReduce job on an exported file outside of the scope of HBase relies on the permissions of the underlying filesystem and server, and bypasses ACLs, visibility labels, and encryption that may otherwise be provided by your HBase cluster.

A new API, TableSnapshotInputFormat, is provided. For more information, see TableSnapshotInputFormat.

MapReduce over snapshots was introduced in CDH 5.0.

Stateless Streaming Scanner over REST

A new stateless streaming scanner is available over the REST API. Using this scanner, clients do not need to restart a scan if the REST server experiences a transient failure. All query parameters are specified during the REST request. Query parameters include startrow, endrow, columns, starttime, endtime, maxversions, batchtime, and limit. Following are a few examples of using the stateless streaming scanner.

Scan the entire table, return the results in JSON.
curl -H "Accept: application/json" https://localhost:8080/ExampleScanner/*
Scan the entire table, return the results in XML.
curl -H "Content-Type: text/xml" https://localhost:8080/ExampleScanner/*
Scan only the first row.
curl -H "Content-Type: text/xml" \ 
Scan only specific columns.
curl -H "Content-Type: text/xml" \ 
Scan for rows between starttime and endtime.
curl -H "Content-Type: text/xml" \
Scan for a given row prefix.
curl -H "Content-Type: text/xml" https://localhost:8080/ExampleScanner/test*

For full details about the stateless streaming scanner, see the API documentation for this feature.

Delete Methods of Put Class Now Use Constructor Timestamps

The Delete() methods of the Put class of the HBase Client API previously ignored the constructor's timestamp, and used the value of HConstants.LATEST_TIMESTAMP. This behavior was different from the behavior of the add() methods. The Delete() methods now use the timestamp from the constructor, creating consistency in behavior across the Put class. See HBASE-10964.

Experimental Features

  Warning: These features are still considered experimental. Experimental features are not supported and Cloudera does not recommend using them in production environments or with important data.

Visibility Labels

You can now use visibility labels, such as CONFIDENTIAL, TOPSECRET, and PUBLIC, at the cell level. In addition, these labels can be grouped using logical operators &, |, and ! (AND, OR, NOT). A given user is associated with a set of visibility labels, and the policy for associating the labels is pluggable. The default plugin ( checks for visibility labels on cells that would be returned by a Get or Scan and drops the cells that a user is not authorized to see. You can specify custom implementations of ScanLabelGenerator by setting the property hbase.regionserver.scan.visibility.label.generator.class to a comma-separated list of classes.

No labels are configured by default. You can add a label to the system using either the VisibilityClient#addLabels() API or the add_label shell command. Similar APIs and shell commands are provided for deleting labels and assigning them to users. Only a user with superuser access (the hbase.superuser access level) can perform these operations.

To assign a visibility label to a cell, you can label the cell using the API method Mutation#setCellVisibility(new CellVisibility(<labelExp>));.

Visibility labels and request authorizations cannot contain the symbols &, |, !, ( and ) because they are reserved for constructing visibility expressions. See HBASE-10883.

For more information about visibility labels, see the Visibility Labels section of the Apache HBase Reference Guide.

If you use visibility labels along with access controls, you must ensure that the Access Controller is loaded before the Visibility Controller in the list of coprocessors. This is the default configuration. See HBASE-11275.

In order to use per-cell access controls or visibility labels, you must use HFile version 3. To enable HFile version 3, add the following to hbase-site.xml, using an advanced configuration snippet if you use Cloudera Manager, or directly to the file if your deployment is unmanaged. Changes will take effect after the next major compaction.

Visibility labels are an experimental feature introduced in CDH 5.1.

Per-Cell Access Controls

You can now specify access control levels at the per-cell level, as well as at the level of the cluster, table, or row.
A new parent class has been provided, which encompasses Get, Scan, and Query. This change also moves the getFilter and setFilter methods of Get and Scan to the common parent class. Client code may need to be recompiled to take advantage of per-cell ACLs. See the Access Control section of the Apache HBase Reference Guide for more information.
The ACLS for cells having timestamps in the future are not considered for authorizing the pending mutation operations. See HBASE-10823.
If you use visibility labels along with access controls, you must ensure that the Access Controller is loaded before the Visibility Controller in the list of coprocessors. This is the default configuration.
In order to use per-cell access controls or visibility labels, you must use HFile version 3. To enable HFile version 3, add the following to hbase-site.xml, using an advanced configuration snippet if you use Cloudera Manager, or directly to the file if your deployment is unmanaged.. Changes will take effect after the next major compaction.
Per-cell access controls are an experimental feature introduced in CDH 5.1.

Transparent Server-Side Encryption

Transparent server-side encryption can now be enabled for both HFiles and write-ahead logs (WALs), to protect their contents at rest. To configure transparent encryption, first create an encryption key, then configure the appropriate settings in hbase-site.xml. See the Transparent Encryption section in the Apache HBase Reference Guide for more information.
Transparent server-side encryption is an experimental feature introduced in CDH 5.1.

Stripe Compaction

Stripe compaction is a compaction scheme that segregates the data inside a region by row key, creating "stripes" of data which are visible within the region but transparent to normal operations. This striping improves read performance in common scenarios and greatly reduces variability by avoiding large and/or inefficient compactions.

Configuration guidelines and more information are available at Stripe Compaction.

To configure stripe compaction for a single table from within the HBase shell, use the following syntax.
alter <table>, CONFIGURATION => {<setting> => <value>}
	Example: alter 'orders', CONFIGURATION => {'' => 10}
To configure stripe compaction for a column family from within the HBase shell, use the following syntax.
alter <table>, {NAME => <column family>, CONFIGURATION => {<setting => <value>}}
	Example: alter 'logs', {NAME => 'blobs', CONFIGURATION => {'' => 10}}

Stripe compaction is an experimental feature in CDH 5.1.

Distributed Log Replay

After a region server fails, its failed region is assigned to another region server, which is marked as "recovering" in ZooKeeper. A SplitLogWorker directly replays edits from the WAL of the failed region server to the region at its new location. When a region is in "recovering" state, it can accept writes but no reads (including Append and Increment), region splits or merges. Distributed Log Replay extends the distributed log splitting framework. It works by directly replaying WAL edits to another region server instead of creating recovered.edits files.
Distributed log replay provides the following advantages over using the current distributed log splitting functionality on its own.
  • It eliminates the overhead of writing and reading a large number of recovered.edits files. It is not unusual for thousands of recovered.edits files to be created and written concurrently during a region server recovery. Many small random writes can degrade overall system performance.
  • It allows writes even when a region is in recovering state. It only takes seconds for a recovering region to accept writes again.
To enable distributed log replay, set hbase.master.distributed.log.replay to true. You must also enable HFile version 3. Distributed log replay is unsafe for rolling upgrades.

Distributed log replay is an experimental feature in CDH 5.1.

CDH 5.0.x HBase Changes

HBase in CDH 5.0.x is based on the Apache HBase 0.96 release. When upgrading to CDH 5.0.x, keep the following in mind.

Wire Compatibility

HBase in CDH 5.0.x (HBase 0.96) is not wire compatible with CDH 4 (based on 0.92 and 0.94 releases). Consequently, rolling upgrades from CDH 4 to CDH 5 are not possible because existing CDH 4 HBase clients cannot make requests to CDH 5 servers and CDH 5 HBase clients cannot make requests to CDH 4 servers. Clients of the Thrift and REST proxy servers, however, retain wire compatibility between CDH 4 and CDH 5.

Upgrade is Not Reversible

The upgrade from CDH 4 HBase to CDH 5 HBase is irreversible and requires HBase to be shut down completely. Executing the upgrade script reorganizes existing HBase data stored on HDFS into new directory structures, converts HBase 0.90 HFile v1 files to the improved and optimized HBase 0.96 HFile v2 file format, and rewrites the hbase.version file. This upgrade also removes transient data stored in ZooKeeper during the conversion to the new data format.

These changes were made to reduce the impact in future major upgrades. Previously HBase used brittle custom data formats and this move shifts HBase's RPC and persistent data to a more evolvable Protocol Buffer data format.

API Changes

The HBase User API (Get, Put, Result, Scanner etc; see Apache HBase API documentation) has evolved and attempts have been made to make sure the HBase Clients are source code compatible and thus should recompile without needing any source code modifications. This cannot be guaranteed however, since with the conversion to Protocol Buffers (ProtoBufs), some relatively obscure APIs have been removed. Rudimentary efforts have also been made to preserve recompile compatibility with advanced APIs such as Filters and Coprocessors. These advanced APIs are still evolving and our guarantees for API compatibility are weaker here.

For information about changes to custom filters, see Custom Filters.

As of 0.96, the User API has been marked and all attempts at compatibility in future versions will be made. A version of the javadoc that only contains the User API can be found here.

HBase Metrics Changes

HBase provides a metrics framework based on JMX beans. Between HBase 0.94 and 0.96, the metrics framework underwent many changes. Some beans were added and removed, some metrics were moved from one bean to another, and some metrics were renamed or removed. Click here to download the CSV spreadsheet which provides a mapping.

Custom Filters

If you used custom filters written for HBase 0.94, you need to recompile those filters forHBase 0.96. The custom filter must be altered to fit with the newer interface that uses protocol buffers. Specifically two new methods, toByteArray(…) and parseFrom(…), which are detailed in detailed in the Filter API. These should be used instead of the old methods write(…) and readFields(…), so that protocol buffer serialization is used. To see what changes were required to port one of HBase's own custom filters, see the Git commit that represented porting the SingleColumnValueFilter filter.


In CDH 4, HBase relied on HDFS checksums to protect against data corruption. When you upgrade to CDH 5, HBase checksums are now turned on by default. With this configuration, HBase reads data and then verifies the checksums. Checksum verification inside HDFS will be switched off. If the HBase-checksum verification fails, then the HDFS checksums are used instead for verifying data that is being read from storage. Once you turn on HBase checksums, you will not be able to roll back to an earlier HBase version.

You should see a modest performance gain after setting hbase.regionserver.checksum.verify to true for data that is not already present in the Region Server's block cache.

To enable or disable checksums, modify the following configuration properties in hbase-site.xml:

      If set to  true, HBase will read data and then verify checksums  for
      hfile blocks. Checksum verification inside HDFS will be switched off.
      If the hbase-checksum verification fails, then it will  switch back to
      using HDFS checksums.
The default value for the hbase.hstore.checksum.algorithm property has also changed to CRC32. Previously, Cloudera advised setting it to NULL due to performance issues which are no longer a problem.
     Name of an algorithm that is used to compute checksums. Possible values
     are NULL, CRC32, CRC32C.