Apache HBase Known Issues

Known Issues In CDH 5.7.0

Unsupported Features of Apache HBase 1.2

  • Although Apache HBase 1.2 allows replication of hbase:meta, this feature is not supported by Cloudera and should not be used on CDH clusters until further notice.
  • The FIFO compaction policy has not been thoroughly tested and is not supported in CDH 5.7.0.
  • Although Apache HBase 1.2 adds a new permissive mode to allow mixed secure and insecure clients, this feature is not supported by Cloudera and should not be used on CDH clusters until further notice.

The ReplicationCleaner process can abort if its connection to ZooKeeper is inconsistent.

Bug: HBASE-15234

If the connection with ZooKeeper is inconsistent, the ReplicationCleaner may abort, and the following event will be logged by the HMaster:
WARN org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner: Aborting ReplicationLogCleaner
because Failed to get list of replicators
Unprocessed WALs will accumulate.

Workaround: Restart the HMaster occasionally. The ReplicationCleaner will restart if necessary and process the unprocessed WALs.

IntegrationTestReplication fails if replication does not finish before the verify phase begins.

Bug: None.

During IntegrationTestReplication, if the verify phase starts before the replication phase finishes, the test will fail because the target cluster does not contain all of the data. If the HBase services in the target cluster does not have enough memory, long garbage-collection pauses might occur

Workaround: Use the -t flag to set the timeout value before starting verification.

Cloudera has tested the performance impact of using HDFS encryption with HBase. The overall overhead of HDFS encryption on HBase performance is in the range of 3 to 4% for both read and update workloads. Scan performance has not been thoroughly tested.

Known Issues in CDH 5.6.0

The ReplicationCleaner process can abort if its connection to ZooKeeper is inconsistent.

Bug: HBASE-15234

If the connection with ZooKeeper is inconsistent, the ReplicationCleaner may abort, and the following event will be logged by the HMaster:
WARN org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner: Aborting ReplicationLogCleaner
because Failed to get list of replicators
Unprocessed WALs will accumulate.

Workaround: Restart the HMaster occasionally. The ReplicationCleaner will restart if necessary and process the unprocessed WALs.

ExportSnapshot or DistCp operations may fail on the Amazon s3a:// protocol.

Bug: None.

ExportSnapshot or DistCP operations may fail on on AWS when using certain JDK 8 versions, due to an incompatibility between the AWS Java SDK 1.9.x and the joda-time date-parsing module.

Workaround: Use joda-time 2.8.1 or higher, which is included in AWS Java SDK 1.10.1 or higher.

Reverse scans do not work when Bloom blocks or leaf-level inode blocks are present.

Bug: HBASE-14283

Because the seekBefore() method calculates the size of the previous data block by assuming that data blocks are contiguous, and HFile v2 and higher store Bloom blocks and leaf-level inode blocks with the data, reverse scans do not work when Bloom blocks or leaf-level inode blocks are present when HFile v2 or higher is used.

Workaround: None.

Known Issues In CDH 5.5.1

The ReplicationCleaner process can abort if its connection to ZooKeeper is inconsistent.

Bug: HBASE-15234

If the connection with ZooKeeper is inconsistent, the ReplicationCleaner may abort, and the following event will be logged by the HMaster:
WARN org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner: Aborting ReplicationLogCleaner
because Failed to get list of replicators
Unprocessed WALs will accumulate.

Workaround: Restart the HMaster occasionally. The ReplicationCleaner will restart if necessary and process the unprocessed WALs.

Extra steps must be taken when upgrading from CDH 4.x to CDH 5.5.1.

The fix for TSB 2015-98 disables legacy object serialization. This will cause direct upgrades on HBase clusters from CDH 4.x to CDH 5.5.1 to fail if one of the workarounds below is not used.

Bug: HBASE-14799

Workaround: Use one of the following workarounds to upgrade from CDH 4.x to CDH 5.1:
  • Upgrade to a CDH 5 version prior to CDH 5.5.1, and then upgrade from that version to CDH 5.5.1, or
  • Set the hbase.allow.legacy.object.serialization to true in the Advanced Configuration Snippet for hbase-site.xml if using Cloudera Manager, or directly in hbase-site.xml on an unmanaged cluster. Upgrade your cluster to CDH 5.5.1. Remove the hbase.allow.legacy.object.serialization property or set it to false after migration is complete.

Known Issues In CDH 5.5.0

Data may not be replicated to worker cluster if multiwal multiplicity is set to greater than 1.

Bug: HBASE-13703, HBASE-6617, HBASE-14501

Workaround: Do not use multiwal > 1 with replication. If you use multiwal > 1, do not use replication.

An operating-system level tuning issue in RHEL7 causes 30% latency regressions.

Bug: None

Severity: Medium

Workaround: Avoid using RHEL 7 if you have a latency-critical workload. For a cached workload, consider tuning the C-state (power-saving) behavior of your CPUs.

A RegionServer under extreme duress due to back-to-back garbage collection combined with heavy load on HDFS can lock up while attempting to append to the WAL.

The RegionServer appears operational to ZooKeeper, and continues to host regions, but cannot complete any writes. The most obvious symptom is that all writes to regions on this RegionServer time out, and the RegionServer log shows no progress other than queuing of flushes that never complete. Log messages such as the following may occur:
124028 2015-11-14 05:54:48,659 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 42911ms instead of 3000ms,
  this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.#
124029 2015-11-14 05:54:48,659 WARN org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or
  host machine (eg GC): pause of approximately 41110ms
1806 2015-11-14 04:58:09,952 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog:
  Slow sync cost: 2734 ms, current pipeline: [DatanodeInfoWithStorage[10.17.198.17:20002,DS-56e2cf88-f267-43a8-b964-b29858#
1807 2015-11-14 04:58:09,952 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog:
  Slow sync cost: 2963 ms, current pipeline: [DatanodeInfoWithStorage[10.17.198.17:20002,DS-56e2cf88-f267-43a8-b964-b29858#

Bug: HBASE-14374

Workaround: Restart the RegionServer. To avoid the problem, adjust garbage-collection settings, give the RegionServer more RAM, and reduce the load on HDFS.

Known Issues In CDH 5.4 and Higher

The ReplicationCleaner process can abort if its connection to ZooKeeper is inconsistent.

Bug: HBASE-15234

If the connection with ZooKeeper is inconsistent, the ReplicationCleaner may abort, and the following event will be logged by the HMaster:
WARN org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner: Aborting ReplicationLogCleaner
because Failed to get list of replicators
Unprocessed WALs will accumulate.

Workaround: Restart the HMaster occasionally. The ReplicationCleaner will restart if necessary and process the unprocessed WALs.

Increments and CheckAnd* operations are much slower in CDH 5.4 and higher (since HBase 1.0.0) than in CDH 5.3 and earlier.

This is due to the unification of mvcc and sequenceid done in HBASE-8763.

Bug: HBASE-14460

Workaround: None

Known Issues In CDH 5.3 and Higher

Export to Azure Blob Storage (the wasb:// or wasbs:// protocol) is not supported.

Bug: HADOOP-12717

CDH 5.3 and higher supports Azure Blob Storage for some applications. However, a null pointer exception occurs when you specify a wasb:// or wasbs:// location in the --copy-to option of the ExportSnapshot command or as the output directory (the second positional argument) of the Export command.

Workaround: None.

HBase in CDH 5 Dependent on Protobuf 2.5

HBase has a transitive dependency on protobuf 2.5. Applications that process protobuf 3 messages generate an error:
NoClassDefFoundError: Could notinitialize class org.apache.hadoop.hbase.util.ByteStringer

Workaround: In CDH 5.8 and higher, use the Apache Maven Shade Plugin to rename protobuf 3.0 packages in the byte code. The Java code looks the same and uses the original package name. However, the byte code contains a different name, so when the HBase client classes load protobuf 2.5, there are no conflicting classes.

Some HBase Features Not Supported in CDH 5

The following features, introduced upstream in HBase, are not supported in CDH 5:
  • Visibility labels
  • Transparent server-side encryption
  • Stripe compaction
  • Distributed log replay
For more information, see New Features and Changes for HBase in CDH 5.

Medium Object Blob (MOB) Data Loss After MOB Compaction

If you enable Medium Object Blobs (MOBs) on a table, data loss can occur after a MOB compaction.

When there are no outstanding scanners for HBase regions, by way of optimization, HBase drops cell sequence IDs during normal region compaction for MOB-enabled tables. If a file with no sequence IDs is compacted with an older file that has overlapping cells, the wrong cells may be returned on subsequent compactions. The result is incorrect MOB file references.

The problem manifests as an inability to Scan or Get values from these overlapping rows, and the following WARN-level messages appear in RegionServer logs:

WARN HStore Fail to read the cell, the mob file <file name> doesn't exist
java.io.FileNotFoundException: File does not exist:

This issue is fixed by HBASE-13922.

Releases affected:

CDH 5.3.0, 5.3.1, 5.3.2, 5.3.3, 5.3.4, 5.3.5, 5.3.7, 5.3.8, 5.3.9, 5.3.10

CDH 5.4.0, 5.4.1, 5.4.2, 5.4.3, 5.4.4, 5.4.5, 5.4.7, 5.4.8, 5.4.9, 5.4.10, 5.4.11

CDH 5.5.0, 5.5.1, 5.5.2, 5.5.4, 5.5.5

CDH 5.6.0, 5.6.1

Users affected: HBase users who enable MOB

Severity (Low/Medium/High): High

Impact: MOB data cannot be retrieved from HBase tables.

Immediate action required: Upgrade CDH to version 5.5.6 or 5.7.0 and higher

UnknownScannerException Messages After Upgrade

HBase clients may throw exceptions like the following after an HBase upgrade:
org.apache.hadoop.hbase.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 10092964, already closed?

In this upgrade scenario, these messages are caused by restarting the RegionServer during the upgrade. Restart the HBase client to stop seeing the exceptions. The log message has been improved in CDH 5.8.0 and higher.

HBase moves to Protoc 2.5.0

This change may cause JAR conflicts with applications that have older versions of protobuf in their Java classpath.

Bug: None

Workaround: Update applications to use Protoc 2.5.0.

Write performance may be a little slower in CDH 5 than in CDH 4

Bug: None

Workaround: None, but see Checksums in the HBase section of the Cloudera Installation and Upgrade guide.

Must explicitly add permissions for owner users before upgrading from 4.1. x

In CDH 4.1. x, an HBase table could have an owner. The owner user had full administrative permissions on the table (RWXCA). These permissions were implicit (that is, they were not stored explicitly in the HBase acl table), but the code checked them when determining if a user could perform an operation.

The owner construct was removed as of CDH 4.2.0, and the code now relies exclusively on entries in the acl table. Since table owners do not have an entry in this table, their permissions are removed on upgrade from CDH 4.1. x to CDH 4.2.0 or later.

Bug: None

Anticipated Resolution: None; use workaround

Workaround: Add permissions for owner users before upgrading from CDH 4.1. x. You can automate the task of making the owner users' implicit permissions explicit, using code similar to the following. This snippet is intended only to give you an idea of how to proceed; it may not compile and run as it stands.
PERMISSIONS = 'RWXCA'

tables.each do |t|
  table_name = t.getNameAsString
  owner = t.getOwnerString
  LOG.warn( "Granting " + owner +  " with
        " + PERMISSIONS +  " for 
        table " + table_name)  
  user_permission = UserPermission. new(owner.to_java_bytes, table_name.to_java_bytes, 
                                       nil, nil, PERMISSIONS.to_java_bytes)
  protocol.grant(user_permission)
end

Change in default splitting policy from ConstantSizeRegionSplitPolicy to IncreasingToUpperBoundRegionSplitPolicy may create too many splits

This affects you only if you are upgrading from CDH 4.1 or earlier.

Split size is the number of regions that are on this server that all are part of the same table, squared, times the region flush size or the maximum region split size, whichever is smaller. For example, if the flush size is 128MB, then on first flush we will split, making two regions that will split when their size is 2 * 2 * 128MB = 512MB. If one of these regions splits, there are three regions and now the split size is 3 * 3 * 128MB = 1152MB, and so on until we reach the configured maximum file size, and then from then, we'll use that.

This new default policy could create many splits if you have many tables in your cluster.

This default split size has also changed - from 64MB to 128MB; and the region eventual split size, hbase.hregion.max.filesize, is now 10GB (it was 1GB).

Bug: None

Anticipated Resolution: None; use workaround

Workaround: If find you are getting too many splits, either go back to the old split policy or increase the hbase.hregion.memstore.flush.size.

In a cluster where the HBase directory in HDFS is encrypted, an IOException can occur if the BulkLoad staging directory is not in the same encryption zone as the HBase root directory.

If you have encrypted the HBase root directory (hbase.rootdir) and you attempt a BulkLoad where the staging directory is in a different encryption zone from the HBase root directory, you may encounter errors such as:
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
/tmp/output/f/5237a8430561409bb641507f0c531448 can't be moved into an encryption zone.
There are three different directories involved in BulkLoad operations, any of which will cause a similar error if it is not in the same encryption zone as the HBase root directory:
  • The location where the output of the HBase export is dumped
  • The HBase staging directory, which defaults to /tmp/hbase-staging, and is configured using the configuration key hbase.bulkload.staging.dir
  • The final destination, which is usually /hbase

Bug: None

Anticipated Resolution: None; use workaround

Workaround: Configure each of the three locations involved in a BulkLoad operation to be in the same encryption zone. It may be necessary to manually copy the exported files from the HBase export location to a directory within /hbase before attempting the LoadIncremental step of the BulkLoad procedure, and remove the copied files after the BulkLoad has completed. In the interim, extra storage space will be used.

In a nonsecure cluster, MapReduce over HBase does not properly handle splits in the BulkLoad case

You may see errors because of:

  • Missing permissions on the directory that contains the files to bulk load
  • Missing ACL rights for the table/families

Bug: None

Anticipated Resolution: None; use workaround

Workaround: In a nonsecure cluster, execute BulkLoad as the hbase user.

User-provided coprocessors not supported

Cloudera does not provide support for user-provided custom coprocessors of any kind.

Bug: HBASE-6427

Workaround: None

Custom constraints coprocessors (HBASE-4605) not supported

The constraints coprocessor feature provides a framework for constrains and requires you to add your own custom code. Cloudera does not support user-provided custom code, and hence does not support this feature.

Bug: HBASE-4605

Workaround: None

Pluggable split key policy (HBASE-5304) not supported

Cloudera supports the two split policies that are supplied and tested: ConstantSizeSplitPolicy and PrefixSplitKeyPolicy. The code also provides a mechanism for custom policies that are specified by adding a class name to the HTableDescriptor. Custom code added via this mechanism must be provided by the user. Cloudera does not support user-provided custom code, and hence does not support this feature.

Bug: HBASE-5304

Workaround: None

HBase may not tolerate HDFS root directory changes

While HBase is running, do not stop the HDFS instance running under it and restart it again with a different root directory for HBase.

Bug: None

Workaround: None

AccessController postOperation problems in asynchronous operations

When security and Access Control are enabled, the following problems occur:

  • If a Delete Table fails for a reason other than missing permissions, the access rights are removed but the table may still exist and may be used again.
  • If hbaseAdmin.modifyTable() is used to delete column families, the rights are not removed from the Access Control List (ACL) table. The postOperation is implemented only for postDeleteColumn().
  • If Create Table fails, full rights for that table persist for the user who attempted to create it. If another user later succeeds in creating the table, the user who made the failed attempt still has the full rights.

Bug: HBASE-6992

Workaround: None

Native library not included in tarballs

The native library that enables RegionServer page pinning on Linux is not included in tarballs. This could impair performance if you install HBase from tarballs.

Bug: None

Workaround: None