The snapshots feature of the Apache Hadoop Distributed Filesystem (HDFS) enables you to capture point-in-time copies of the file system and protect your important data against corruption, user-, or application errors. This feature is available in all versions of Cloudera Data Platform (CDP), Cloudera Distribution for Hadoop (CDH) and Hortonworks Data Platform (HDP). Regardless of whether you’ve been using snapshots for a while or contemplating their use, this blog gives you the insights and techniques to make them look their best.
Using snapshots to protect data is efficient for a few reasons. First of all, snapshot creation is instantaneous regardless of the size and depth of the directory subtree. Furthermore snapshots capture the block list and file size for a specified subtree without creating extra copies of blocks on the file system. The HDFS snapshot feature is specifically designed to be very efficient for the snapshot creation operation as well as for accessing or modifying the current files and directories in the file system. Creating a snapshot only adds a snapshot record to the snapshottable directory. Accessing a current file or directory does not require processing any snapshot records, so there is no additional overhead. Modifying a current file/directory, when it is also in a snapshot, requires adding a modification record for each input path. The trade-off is that some other operations, such as computing snapshot diffs can be very expensive. In the next couple of sections of this blog, we’ll first look at the complexity of various operations, and then we highlight the best practices that will help mitigate the overhead of these operations.
Let’s look at the time complexity or overheads dealing with different operations on snapshotted files or directories. For simplicity, we assume the number of modifications (m) for each file/directory is the same across a snapshottable directory subtree, where the modifications for each file/directory are the records generated by the changes (e.g. set permission, create a file/directory, rename, etc.) on that file/directory.
1- Taking a snapshot always takes the same amount of effort: it only creates a record of the snapshottable directory and its state at that time. The overhead is independent of the directory structure and we denote the time overhead as O(1)
2- Accessing a file or a directory in the current state is the same as without taking any snapshots. The snapshots add zero overhead compared to the non-snapshot access.
3- Modifying a file or a directory in the current state adds no overhead to the non-snapshot access. It adds a modification record in the filesystem tree for the modified path..
4- Accessing a file or a directory in a particular snapshot is also efficient – it has to traverse the snapshot records from the snapshottable directory down to the desired file/directory and reconstruct the snapshot state from the modification records. The access imposes an overhead of O(d*m), where
d – the depth from the snapshotted directory to the desired file/directory
m – the number of modifications captured from the current state to the given snapshot.
5- Deleting a snapshot requires traversing the entire subtree and, for each file or directory, binary search the to-be-deleted snapshot. It also collects blocks to be deleted as a result of the operation. This results in an overhead of O(b + n log(m)) where
b – the number of blocks to be collected,
n – the number of files/directories under the snapshot diff path
m – the number of modifications captured from the current state to the to-be-deleted snapshot.
Note that deleting a snapshot only performs log(m) operations for binary searching the to-be-deleted snapshot but not for reconstructing it.
6- Computing the snapshot diff between a newer and an older snapshot has to reconstruct the newer snapshot state for each file and directory under the snapshot diff path. Then the process has to compute the diff between the newer and the older snapshot. This imposes and overhead of O(n*(m+s)), where
n – the number of files and directories under the snapshot diff path,
m – the number of modifications captured from the current state to the newer snapshot
s – the number of snapshots between the newer and the older snapshots.
We summarize the operations in the table below:
Operation | Overhead | Remarks |
Taking a snapshot | O(1) | Adding a snapshot record |
Accessing a file/directory in the current state | No additional overhead from snapshots. | NA |
Modifying a file/directory in the current state | Adding a modification for each input path. | NA |
Accessing a file/directory in a particular snapshot | O(d*m) |
|
Deleting a snapshot | O(b + n log(m)) |
|
Computing snapshot diff | O(n(m+s)) |
|
We provide best practice guidelines in the next section.
Now that you are fully aware of the operational impact operations on snapshotted files and directories have, here are some key tips and tricks to help you get the most benefit from your HDFS Snapshot usage.
Example: Suppose we have the following operation.
When running diff at /, it will show the rename operation: |
Difference between snapshot s0 and snapshot s1 under directory /:
M ./foo/bar R ./foo/bar/file -> ./sub/file M ./sub When running diff at subtrees /foo and /sub, it will show the rename operation as delete-and-create: Difference between snapshot s0 and snapshot s1 under directory /sub: M . + ./file Difference between snapshot s0 and snapshot s1 under directory /foo: M ./bar - ./bar/file
In this blog, we have explored the HDFS Snapshot feature, how it works, and the impact various file operations in snapshotted directories have on overheads. To help you get started, we also highlighted several best practices and recommendations in working with Snapshots to draw out the benefits with minimal overheads.
For more information about using HDFS Snapshots, please read the Cloudera Documentation
on the subject. Our Professional Services, Support and Engineering teams are available to share their knowledge and expertise with you to implement Snapshots effectively. Please reach out to your Cloudera account team or get in touch with us here.
This may have been caused by one of the following: