Managing HBase Snapshots

This page demonstrates how to manage HBase snapshots using either Cloudera Manager or the command line.

Managing HBase Snapshots Using Cloudera Manager

For HBase services, you can use the Table Browser tab to view the HBase tables associated with a service on your cluster. You can view the currently saved snapshots for your tables, and delete or restore them. From the HBase Table Browser tab, you can:

  • View the HBase tables for which you can take snapshots.
  • Initiate immediate (unscheduled) snapshots of a table.
  • View the list of saved snapshots currently maintained. These can include one-off immediate snapshots, as well as scheduled policy-based snapshots.
  • Delete a saved snapshot.
  • Restore from a saved snapshot.
  • Restore a table from a saved snapshot to a new table (Restore As).

Browsing HBase Tables

To browse the HBase tables to view snapshot activity:

  1. From the Clusters tab, select your HBase service.
  2. Go to the Table Browser tab.

Managing HBase Snapshots

Minimum Required Role: BDR Administrator (also provided by Full Administrator)

To take a snapshot:
  1. Click a table.
  2. Click Take Snapshot.
  3. Specify the name of the snapshot, and click Take Snapshot.

To delete a snapshot, click and select Delete.

To restore a snapshot, click and select Restore.
To restore a snapshot to a new table, select Restore As from the menu associated with the snapshot, and provide a name for the new table.

Storing HBase Snapshots on Amazon S3

HBase snapshots can be stored on the cloud storage service Amazon S3 instead of in HDFS. To configure HBase to store snapshots on Amazon S3, you must have the following information:
  • The access key ID for your Amazon S3 account.
  • The secret access key for your Amazon S3 account.
  • The path to the directory in Amazon S3 where you want your HBase snapshots to be stored.

You can improve the transfer of large snapshots to Amazon S3 by increasing the number of nodes due to throughput limitations of EC2 on a per node basis.

Configuring HBase in Cloudera Manager to Store Snapshots in Amazon S3

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

Perform the following steps in Cloudera Manager:

  1. Open the HBase service page.
  2. Select Scope > HBASE (Service-Wide).
  3. Select Category > Backup.
  4. Type AWS in the Search box.
  5. Enter your Amazon S3 access key ID in the field AWS S3 access key ID for remote snapshots.
  6. Enter your Amazon S3 secret access key in the field AWS S3 secret access key for remote snapshots.
  7. Enter the path to the location in Amazon S3 where your HBase snapshots will be stored in the field Amazon S3 Path for Remote Snapshots.
  8. In a terminal window, log in to your Cloudera Manager cluster at the command line and create a /user/hbase directory in HDFS. Change the owner of the directory to hbase. For example:
    hdfs dfs -mkdir /user/hbase
    hdfs dfs -chown hbase /user/hbase

Configuring the Dynamic Resource Pool Used for Exporting and Importing Snapshots in Amazon S3

Dynamic resource pools are used to control the resources available for MapReduce jobs created for HBase snapshots on Amazon S3. By default, MapReduce jobs run against the default dynamic resource pool. To choose a different dynamic resource pool for HBase snapshots stored on Amazon S3, follow these steps:
  1. Open the HBase service page.
  2. Select Scope > HBASE (Service-Wide).
  3. Select Category > Backup.
  4. Type Scheduler in the Search box.
  5. Enter name of a dynamic resource pool in the Scheduler pool for remote snapshots in AWS S3 property.
  6. Click Save Changes.

HBase Snapshots on Amazon S3 with Kerberos Enabled

Starting with Cloudera Manager 5.8, YARN should by default allow the hbase user to run MapReduce jobs even when Kerberos is enabled. However, this change only applies to new Cloudera Manager deployments, and not if you have upgraded from a previous version to Cloudera Manager 5.8 (or higher).

If Kerberos is enabled on your cluster, and YARN does not allow the hbase user to run MapReduce jobs, perform the following steps:
  1. Open the YARN service page in Cloudera Manager.
  2. Select Scope > NodeManager.
  3. Select Category > Security.
  4. In the Allowed System Users property, click the + sign and add hbase to the list of allowed system users.
  5. Click Save Changes.
  6. Restart the YARN service.

Managing HBase Snapshots on Amazon S3 in Cloudera Manager

Minimum Required Role: BDR Administrator (also provided by Full Administrator)

To take HBase snapshots and store them on Amazon S3, perform the following steps:

  1. On the HBase service page in Cloudera Manager, click the Table Browser tab.
  2. Select a table in the Table Browser. If any recent local or remote snapshots already exist, they display on the right side.
  3. In the dropdown for the selected table, click Take Snapshot.
  4. Enter a name in the Snapshot Name field of the Take Snapshot dialog box.
  5. If Amazon S3 storage is configured as described above, the Take Snapshot dialog box Destination section shows a choice of Local or Remote S3. Select Remote S3.
  6. Click Take Snapshot.

    While the Take Snapshot command is running, a local copy of the snapshot with a name beginning cm-tmp followed by an auto-generated filename is displayed in the Table Browser. This local copy is deleted as soon as the remote snapshot has been stored in Amazon S3. If the command fails without being completed, the temporary local snapshot might not be deleted. This copy can be manually deleted or kept as a valid local snapshot. To store a current snapshot in Amazon S3, run the Take Snapshot command again, selecting Remote S3 as the Destination, or use the HBase command-line tools to manually export the existing temporary local snapshot to Amazon S3.

Deleting HBase Snapshots from Amazon S3

To delete a snapshot stored in Amazon S3:
  1. Select the snapshot in the Table Browser.
  2. Click the dropdown arrow for the snapshot.
  3. Click Delete.

Restoring an HBase Snapshot from Amazon S3

To restore an HBase snapshot that is stored in Amazon S3:
  1. Select the table in the Table Browser.
  2. Click Restore Table.
  3. Choose Remote S3 and select the table to restore.
  4. Click Restore.

    Cloudera Manager creates a local copy of the remote snapshot with a name beginning with cm-tmp followed by an auto-generated filename, and uses that local copy to restore the table in HBase. Cloudera Manager then automatically deletes the local copy. If the Restore command fails without completing, the temporary copy might not be deleted and can be seen in the Table Browser. In that case, delete the local temporary copy manually and re-run the Restore command to restore the table from Amazon S3.

Restoring an HBase Snapshot from Amazon S3 with a New Name

By restoring an HBase snapshot stored in Amazon S3 with a new name, you clone the table without affecting the existing table in HBase. To do this, perform the following steps:

  1. Select the table in the Table Browser.
  2. Click Restore Table From Snapshot As.
  3. In the Restore As dialog box, enter a new name for the table in the Restore As field.
  4. Select Remote S3 and choose the snapshot in the list of available Amazon S3 snapshots.

Managing Policies for HBase Snapshots in Amazon S3

You can configure policies to automatically create snapshots of HBase tables on an hourly, daily, weekly, monthly or yearly basis. Snapshot policies for HBase snapshots stored in Amazon S3 are configured using the same procedures as for local HBase snapshots. These procedures are described in Cloudera Manager Snapshot Policies. For snapshots stored in Amazon S3, you must also choose Remote S3 in the Destination section of the policy management dialog boxes.

When you create a snapshot based on a snapshot policy, a local copy of the snapshot is created with a name beginning with cm-auto followed by an auto-generated filename. The temporary copy of the snapshot is displayed in the Table Browser and is deleted as soon as the remote snapshot has been stored in Amazon S3. If the snapshot procedure fails without being completed, the temporary local snapshot might not be deleted. This copy can be manually deleted or kept as a valid local snapshot. To export the HBase snapshot to Amazon S3, use the HBase command-line tools to manually export the existing temporary local snapshot to Amazon S3.

Managing HBase Snapshots Using the Command Line

About HBase Snapshots

In previous HBase releases, the only way to a back up or to clone a table was to use CopyTable or ExportTable, or to copy all the hfiles in HDFS after disabling the table. These methods have disadvantages:

  • CopyTable and ExportTable can degrade RegionServer performance.
  • Disabling the table means no reads or writes; this is usually unacceptable.

HBase snapshots allow you to clone a table without making data copies, and with minimal impact on RegionServers. Exporting the table to another cluster does not have any impact on the RegionServers.

Use Cases

  • Recovery from user or application errors
    • Useful because it may be some time before the database administrator notices the error.
    • The database administrator may want to save a snapshot before a major application upgrade or change.
    • Recovery cases:
      • Roll back to previous snapshot and merge in reverted data.
      • View previous snapshots and selectively merge them into production.
  • Backup
    • Capture a copy of the database and store it outside HBase for disaster recovery.
    • Capture previous versions of data for compliance, regulation, and archiving.
    • Export from a snapshot on a live system provides a more consistent view of HBase than CopyTable and ExportTable.
  • Audit or report view of data at a specific time
    • Capture monthly data for compliance.
    • Use for end-of-day/month/quarter reports.
  • Application testing
    • Test schema or application changes on similar production data from a snapshot and then discard.
      For example:
      1. Take a snapshot.
      2. Create a new table from the snapshot content (schema and data)
      3. Manipulate the new table by changing the schema, adding and removing rows, and so on. The original table, the snapshot, and the new table remain independent of each other.
  • Offload work
    • Capture, copy, and restore data to another site
    • Export data to another cluster

Where Snapshots Are Stored

Snapshot metadata is stored in the .hbase_snapshot directory under the hbase root directory (/hbase/.hbase-snapshot). Each snapshot has its own directory that includes all the references to the hfiles, logs, and metadata needed to restore the table.

hfiles required by the snapshot are in the /hbase/data/<namespace>/<tableName>/<regionName>/<familyName>/ location if the table is still using them; otherwise, they are in /hbase/.archive/<namespace>/<tableName>/<regionName>/<familyName>/.

Zero-Copy Restore and Clone Table

From a snapshot, you can create a new table (clone operation) or restore the original table. These two operations do not involve data copies; instead, a link is created to point to the original hfiles.

Changes to a cloned or restored table do not affect the snapshot or (in case of a clone) the original table.

To clone a table to another cluster, you export the snapshot to the other cluster and then run the clone operation; see Exporting a Snapshot to Another Cluster.

Reverting to a Previous HBase Version

Snapshots do not affect HBase backward compatibility if they are not used.

If you use snapshots, backward compatibility is affected as follows:

  • If you only take snapshots, you can still revert to a previous HBase version.
  • If you use restore or clone, you cannot revert to a previous version unless the cloned or restored tables have no links. Links cannot be detected automatically; you would need to inspect the file system manually.

Storage Considerations

Because hfiles are immutable, a snapshot consists of a reference to the files in the table at the moment the snapshot is taken. No copies of the data are made during the snapshot operation, but copies may be made when a compaction or deletion is triggered. In this case, if a snapshot has a reference to the files to be removed, the files are moved to an archive folder, instead of being deleted. This allows the snapshot to be restored in full.

Because no copies are performed, multiple snapshots share the same hfiles, butfor tables with lots of updates, and compactions, each snapshot could have a different set of hfiles.

Configuring and Enabling Snapshots

Snapshots are on by default; to disable them, set the hbase.snapshot.enabled property in hbase-site.xml to false:

<property>
   <name>hbase.snapshot.enabled</name>
   <value>
      false
   </value>
</property>

To enable snapshots after you have disabled them, set hbase.snapshot.enabled to true.

Snapshots do not affect HBase performance if they are not used.

Shell Commands

You can manage snapshots by using the HBase shell or the HBaseAdmin Java API.

The following table shows actions you can take from the shell.

Action

Shell command

Comments

Take a snapshot of tableX called snapshotX

snapshot 'tableX', 'snapshotX'

Snapshots can be taken while a table is disabled, or while a table is online and serving traffic.

  • If a table is disabled (using disable <table>), an offline snapshot is taken. This snapshot is managed by the master and fully consistent with the state when the table was disabled. This is the simplest and safest method, but it involves a service interruption because the table must be disabled to take the snapshot.
  • In an online snapshot, the table remains available while the snapshot is taken, and incurs minimal performance degradation of normal read/write loads. This snapshot is managed by the master and run on the RegionServers. The current implementation—simple-flush snapshots—provides no causal consistency guarantees. Despite this shortcoming, it offers the same degree of consistency as CopyTable and is a significant improvement.

Restore snapshot snapshotX (replaces the source table content)

restore_snapshot ‘snapshotX’

For emergency use only; see Restrictions.

Restoring a snapshot replaces the current version of a table with different version. To run this command, you must disable the target table. The restore command takes a snapshot of the table (appending a timestamp code), and then clones data into the original data and removes data not in the snapshot. If the operation succeeds, the target table is enabled.

List all available snapshots

list_snapshots

 

List all available snapshots starting with ‘mysnapshot_’ (regular expression)

list_snapshots ‘my_snapshot_.*’

 

Remove a snapshot called snapshotX

delete_snapshot ‘snapshotX’
 

Create a new table tableY from a snapshot snapshotX

clone_snapshot ‘snapshotX’, ‘tableY’

Cloning a snapshot creates a new read/write table that serves the data kept at the time of the snapshot. The original table and the cloned table can be modified independently; new data written to one table does not show up on the other.

Taking a Snapshot Using a Shell Script

You can take a snapshot using an operating system shell script, such as a Bash script, in HBase Shell noninteractive mode, which is described in Accessing HBase by using the HBase Shell. This example Bash script shows how to take a snapshot in this way. This script is provided as an illustration only; do not use in production.
#!/bin/bash
# Take a snapshot of the table passed as an argument
# Usage: snapshot_script.sh table_name
# Names the snapshot in the format snapshot-YYYYMMDD

# Parse the arguments
if [ -z $1 ]||[$1 == '-h' ]; then

echo "Usage: $0 &lt;table&gt;"

echo "       $0 -h"

exit 1
fi

# Modify to suit your environment
export HBASE_PATH=/home/user/hbase
export DATE=`date +"%Y%m%d"`
echo "snapshot '$1', 'snapshot-$DATE'" | $HBASE_PATH/bin/hbase shell -n 
status=$?
if [$status -ne 0]; then

echo "Snapshot may have failed: $status"
fi
exit $status  

HBase Shell returns an exit code of 0 on successA non-zero exit code indicates the possibility of failure, not a definite failure. Your script should check to see if the snapshot was created before taking the snapshot again, in the event of a reported failure.

Exporting a Snapshot to Another Cluster

You can export any snapshot from one cluster to another. Exporting the snapshot copies the table's hfiles, logs, and the snapshot metadata, from the source cluster to the destination cluster. Specify the -copy-from option to copy from a remote cluster to the local cluster or another remote cluster. If you do not specify the -copy-from option, the hbase.rootdir in the HBase configuration is used, which means that you are exporting from the current cluster. You must specify the -copy-to option, to specify the destination cluster.

The ExportSnapshot tool executes a MapReduce Job similar to distcp to copy files to the other cluster. It works at file-system level, so the HBase cluster can be offline.

Run ExportSnapshot as the hbase user or the user that owns the files. If the user, group, or permissions need to be different on the destination cluster than the source cluster, use the -chuser, -chgroup, or -chmod options as in the second example below, or be sure the destination directory has the correct permissions. In the following examples, replace the HDFS server path and port with the appropriate ones for your cluster.

To copy a snapshot called MySnapshot to an HBase cluster srv2 (hdfs://srv2:8020/hbase) using 16 mappers:

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:<hdfs_port>/hbase -mappers 16

To export the snapshot and change the ownership of the files during the copy:

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:<hdfs_port>/hbase -chuser MyUser -chgroup MyGroup -chmod 700 -mappers 16
You can also use the Java -D option in many tools to specify MapReduce or other configuration properties. For example, the following command copies MY_SNAPSHOT to hdfs://cluster2/hbase using groups of 10 hfiles per mapper:
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -Dsnapshot.export.default.map.group=10 -snapshot MY_SNAPSHOT -copy-to hdfs://cluster2/hbase
(The number of mappers is calculated as TotalNumberOfHFiles/10.)

To export from one remote cluster to another remote cluster, specify both -copy-from and -copy-to parameters.

You can then reverse the direction to restore the snapshot back to the first remote cluster.
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snapshot-test -copy-from hdfs://machine1/hbase -copy-to hdfs://machine2/my-backup
To specify a different name for the snapshot on the target cluster, use the -target option.
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snapshot-test -copy-from hdfs://machine1/hbase -copy-to hdfs://machine2/my-backup -target new-snapshot

Restrictions

  • All the masters and RegionServers must be running CDH 5.
  • If you have enabled the AccessController Coprocessor for HBase, only a global administrator can take, clone, or restore a snapshot, and these actions do not capture the ACL rights. This means that restoring a table preserves the ACL rights of the existing table, and cloning a table creates a new table that has no ACL rights until the administrator adds them.
  • Do not take, clone, or restore a snapshot during a rolling restart. Snapshots require RegionServers to be up; otherwise, the snapshot fails.

If you are using HBase Replication and you need to restore a snapshot:

If you are using HBase Replication, the replicas will be out of sync when you restore a snapshot. If you need to restore a snapshot, proceed as follows:

  1. Disable the table that is the restore target, and stop the replication.
  2. Remove the table from both the master and worker clusters.
  3. Restore the snapshot on the master cluster.
  4. Create the table on the worker cluster and use CopyTable to initialize it.

Snapshot Failures

Region moves, splits, and other metadata actions that happen while a snapshot is in progress can cause the snapshot to fail. The software detects and rejects corrupted snapshot attempts.

Information and Debugging

You can use the SnapshotInfo tool to get information about a snapshot, including status, files, disk usage, and debugging information.

Examples:

Use the -h option to print usage instructions for the SnapshotInfo utility.

$ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo -h
Usage: bin/hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo [options]
 where [options] are:
  -h|-help                Show this help and exit.
  -remote-dir             Root directory that contains the snapshots.
  -list-snapshots         List all the available snapshots and exit.
  -snapshot NAME          Snapshot to examine.
  -files                  Files and logs list.
  -stats                  Files and logs stats.
  -schema                 Describe the snapshotted table.

Use the -list-snapshots option to list all snapshots and exit.

$ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo -list-snapshots
SNAPSHOT             | CREATION TIME        | TABLE NAME
snapshot-test        |  2014-06-24T19:02:54 | test

Use the -remote-dir option with the -list-snapshots option to list snapshots located on a remote system.

$ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo -remote-dir s3a://mybucket/mysnapshot-dir -list-snapshots
SNAPSHOT |   CREATION TIME  | TABLE NAME
snapshot-test        2014-05-01 10:30    myTable

Use the -snapshot option to print information about a specific snapshot.

$ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo -snapshot test-snapshot
Snapshot Info
----------------------------------------
   Name: test-snapshot
   Type: DISABLED
  Table: test-table
Version: 0
Created: 2012-12-30T11:21:21
**************************************************************
Use the -snapshot with the -stats options to display additional statistics about a snapshot.
$ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo -stats -snapshot snapshot-test
Snapshot Info
----------------------------------------
   Name: snapshot-test
   Type: FLUSH
  Table: test
 Format: 0
Created: 2014-06-24T19:02:54


1 HFiles (0 in archive), total size 1.0k (100.00% 1.0k shared with the source table)
Use the -schema option with the -snapshot option to display the schema of a snapshot.
$ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo  -schema -snapshot snapshot-test
Snapshot Info
----------------------------------------
   Name: snapshot-test
   Type: FLUSH
  Table: test
 Format: 0
Created: 2014-06-24T19:02:54

Table Descriptor
----------------------------------------
'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST_DIFF', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', 
COMPRESSION => 'GZ', VERSIONS => '1', TTL => 'FOREVER', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'false', 
BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

Use the -files option with the -snapshot option to list information about files contained in a snapshot.

$ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo -snapshot test-snapshot -files
Snapshot Info
----------------------------------------
   Name: test-snapshot
   Type: DISABLED
  Table: test-table
Version: 0
Created: 2012-12-30T11:21:21

Snapshot Files
----------------------------------------
   52.4k test-table/02ba3a0f8964669520cf96bb4e314c60/cf/bdf29c39da2a4f2b81889eb4f7b18107 (archive)
   52.4k test-table/02ba3a0f8964669520cf96bb4e314c60/cf/1e06029d0a2a4a709051b417aec88291 (archive)
   86.8k test-table/02ba3a0f8964669520cf96bb4e314c60/cf/506f601e14dc4c74a058be5843b99577 (archive)
   52.4k test-table/02ba3a0f8964669520cf96bb4e314c60/cf/5c7f6916ab724eacbcea218a713941c4 (archive)
  293.4k test-table/02ba3a0f8964669520cf96bb4e314c60/cf/aec5e33a6564441d9bd423e31fc93abb (archive)
   52.4k test-table/02ba3a0f8964669520cf96bb4e314c60/cf/97782b2fbf0743edaacd8fef06ba51e4 (archive)

6 HFiles (6 in archive), total size 589.7k (0.00% 0.0 shared with the source table)
0 Logs, total size 0.0