Managing HDFS Snapshots

This page demonstrates how to manage HDFS Snapshots using either Cloudera Manager or the command line.

Managing HDFS Snapshots Using Cloudera Manager

For HDFS (CDH 5 only) services, a File Browser tab is available where you can view the HDFS directories associated with a service on your cluster. From here you can view the currently saved snapshots for your files, and delete or restore them as appropriate. From the HDFS File Browser tab you can:

  • Designate HDFS directories to be "snapshottable" so snapshots can be created for those directories.
  • Initiate immediate (unscheduled) snapshots of a table.
  • View the list of saved snapshots currently being maintained. These may include one-off immediate snapshots, as well as scheduled policy-based snapshots.
  • Delete a saved snapshot.
  • Restore an HDFS directory or file from a saved snapshot.
  • Restore an HDFS directory or file from a saved snapshot to a new directory or file (Restore As)

Browsing HDFS Directories

To browse the HDFS directories to view snapshot activity:

  1. From the Clusters tab, select your CDH 5 HDFS service.
  2. Go to the File Browser tab.
As you browse the directory structure of your HDFS, basic information about the directory you have selected is shown at the right (owner, group, and so on).

Enabling HDFS Snapshots

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

HDFS directories must be enabled for snapshots in order for snapshots to be created. You cannot specify a directory as part of a snapshot policy unless it has been enabled for snapshotting.

To enable a HDFS directory for snapshots:
  1. From the Clusters tab, select your CDH 5 HDFS service.
  2. Go to the File Browser tab.
  3. Verify the Snapshottable Path and click Enable Snapshots.

To disable snapshots for a directory that has snapshots enabled, use the Disable Snapshots from the drop-down menu button at the upper right. If there are existing snapshots of the directory, they must be deleted before snapshots can be disabled.

Managing HDFS Snapshots

Minimum Required Role: BDR Administrator (also provided by Full Administrator)

If a directory has been enabled for snapshots:
  • The Take Snapshot button is present, enabling an immediate snapshot of the directory.
  • Any snapshots that have been taken are listed by the time at which they were taken, along with their names and a menu button.

To take a snapshot, click Take Snapshot, specify the name of the snapshot, and click Take Snapshot. The snapshot is added to the snapshot list.

To delete a snapshot, click and select Delete.

To restore a snapshot, click and select Restore.

For restoring HDFS data, if a MapReduce or YARN service is present in the cluster, then DistributedCopy (distcp) will be used to restore directories, increasing the speed of restoration. The restore popup for HDFS (under More Options) allows selection of either MapReduce or YARN as the MapReduce service. For files, or if a MapReduce or YARN service is not present, a normal copy will be performed. Use of distcp allows configuration of the following options for the snapshot restoration, similar to what is available when configuring a replication:

  • MapReduce Service - The MapReduce or YARN service to use.
  • Scheduler Pool - The scheduler pool to use.
  • Run as - The user that should run the job. By default this is hdfs. If you want to run the job as a different user, you can enter that here. If you are using Kerberos, you must provide a user name here, and it must be one with an ID greater than 1000. Verify that the user running the job has a home directory, /user/<username>, owned by username:supergroup in HDFS.
  • Log path - An alternative path for the logs.
  • Maximum map slots and Maximum bandwidth - Limits for the number of map slots and for bandwidth per mapper. The defaults are unlimited.
  • Abort on error - Whether to abort the job on an error (default is not to do so). This means that files copied up to that point will remain on the destination, but no additional files will be copied.
  • Skip Checksum Checks - Whether to skip checksum checks (the default is to perform them). If checked, checksum validation will not be performed.
  • Delete policy - Whether files that were removed on the source should also be deleted from the target directory. This policy also determines the handling of files that exist in the target location but are unrelated to the source. There are three options:
    • Keep deleted files - Retains the destination files even when they no longer exist at the source (this is the default).
    • Delete to trash - If the HDFS trash is enabled, files will be moved to the trash folder.
    • Delete permanently - Uses least amount of space, but should be used with caution.
  • Preserve - Whether to preserve the block size, replication count, permissions, including ACLs, and extended attributes (XAttrs) as they exist on the source file system, or to use the settings as configured on the target file system. The default is to preserve these settings as on the source. When Permission is checked, and both the source and target clusters support ACLs, replication preserves ACLs. Otherwise, ACLs are not replicated. When Extended attributes is checked, and both the source and target clusters support extended attributes, replication preserves them.

Managing HDFS Snapshots Using the Command Line

For information about managing snapshots using the command line, see HDFS Snapshots.