Configuring Heterogeneous Storage in HDFS

CDH supports a variety of storage types in the Hadoop Distributed File System (HDFS). Earlier releases of CDH used a single (or homogeneous) storage model. Now you can choose which storage type to assign to each DataNode Data Directory. Specifying a storage type allows you to optimize your data usage and lower your costs, based on your data usage frequency. This topic describes these storage types and how to configure CDH to use them.

Overview

Each DataNode in a cluster is configured with a set of data directories. You can configure each data directory with a storage type. The storage policy dictates which storage types to use when storing the file or directory.

Some reasons to consider using different types of storage are:

  • You have datasets with temporal locality (for example, time-series data). The latest data can be loaded initially into SSD for improved performance, then migrated out to disk as it ages.
  • You need to move cold data to denser archival storage because the data will rarely be accessed and archival storage is much cheaper. This could be done with simple age-out policies: for example, moving data older than six months to archival storage.

Storage Types

The storage type identifies the underlying storage media. HDFS supports the following storage types:

  • ARCHIVE - Archival storage is for very dense storage and is useful for rarely accessed data. This storage type is typically cheaper per TB than normal hard disks.
  • DISK - Hard disk drives are relatively inexpensive and provide sequential I/O performance. This is the default storage type.
  • SSD - Solid state drives are useful for storing hot data and I/O-intensive applications.
  • RAM_DISK - This special in-memory storage type is used to accelerate low-durability, single-replica writes.

When you add the DataNode Data Directory, you can specify which type of storage it uses, by prefixing the path with the storage type, in brackets. If you do not specify a storage type, it is assumed to be DISK. See Adding Storage Directories.

Storage Policies

A storage policy contains information that describes the type of storage to use. This policy also defines the fallback storage type if the primary type is out of space or out of quota. If a target storage type is not available, HDFS attempts to place replicas on the default storage type.

Each storage policy consists of a policy ID, a policy name, a list of storage types, a list of fallback storage types for file creation, and a list of fallback storage types for replication.

HDFS has six preconfigured storage policies.

  • Hot - All replicas are stored on DISK.
  • Cold - All replicas are stored ARCHIVE.
  • Warm - One replica is stored on DISK and the others are stored on ARCHIVE.
  • All_SSD - All replicas are stored on SSD.
  • One_SSD - One replica is stored on SSD and the others are stored on DISK.
  • Lazy_Persist - The replica is written to RAM_DISK and then lazily persisted to DISK.

Setting Up SSD Storage Using Cloudera Manager

  1. Set up your cluster normally, but customize your DataNodes with the [ssd] prefix for data directories. Adding [ssd] can also be done after initial setup (which requires an extra HDFS restart).
  2. Stop HBase.
  3. Using the HDFS client, move /hbase to /hbase_backup.
  4. Re-create /hbase using the Cloudera Manager command in the HBase service (this ensures that proper permissions are used).
  5. Using the HDFS client, set the storage policy for /hbase to be SSD only.
  6. Use the DistCp to copy /hbase_backup to /hbase.

    hadoop distcp /hbase_backup /hbase

  7. Start HBase.

Setting a Storage Policy for HDFS

Setting a Storage Policy for HDFS Using Cloudera Manager

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

To set a storage policy on a DataNode Data Directory using Cloudera Manager, perform the following tasks:
  1. Check the HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml to be sure that dfs.storage.policy.enabled has not been changed from its default value of true.
  2. Specify the storage types for each DataNode Data Directory that is not a standard disk, by adding the storage type in brackets at the beginning of the directory path. For example:
    [SSD]/dfs/dn1
    [DISK]/dfs/dn2
    [ARCHIVE]/dfs/dn3
  3. Open a terminal session on any HDFS host. Run the following hdfs command for each path on which you want to set a storage policy:
    $ hdfs storagepolicies -setStoragePolicy -path <path> -policy <policy>
    path_to_file_or_directory -policy policy_name
  4. To move the data to the appropriate storage based on the current storage policy, use the mover utility, from any HDFS host. Use mover -h to get a list of available options. To migrate all data at once (this may take a long time), you can set the path to /.
    $ hdfs mover -p <path>

Setting a Storage Policy for HDFS Using the Command Line

To set a storage policy on a file or directory, perform the following tasks:
  1. Make sure the dfs.storage.policy.enabled property (in the conf/hdfs-site.xml file) is set to true. This is the default setting.
  2. Make sure the storage locations are tagged with their storage types. The default storage type is DISK if the directory does not contain a storage type tag. Add the storage type to the dfs.datanode.dir property in conf/hdfs-site.xml.
    <property>
      <name>dfs.datanode.dir</name>
      <value>
       [DISK]file:///grid/dn/disk0,[SSD]file:///grid/dn/ssd0,[RAM_DISK]file:///grid/dn/ram0,
       [ARCHIVE]file:///grid/dn/archive0
      </value>
    </property>
  3. Set the storage policy for the HDFS path. Enter the following command on any HDFS host:
    $ hdfs storagepolicies -setStoragePolicy -path <path> -policy <policy>
    path_to_file_or_directory -policy policy_name
  4. To move the data to the appropriate storage based on the current storage policy, use the mover utility, from any HDFS host. Use mover -h to get a list of available options. To migrate all data at once (this may take a long time), you can set the path to /.
    $ hdfs mover -p <path>

Managing Storage Policies

  • To get the storage policy for a specific file or directory on a DataNode, use the following command, which is available using the command line on a any HDFS host.
    $ hdfs storagepolicies -getStoragePolicy -path <path>path_to_policy
  • To list all policies on a DataNode, enter the following command:
    $ hdfs storagepolicies -listPolicies
  • To reset a storage policy, follow the steps used in Setting a Storage Policy for HDFS.

Migrating Existing Data

To move the data to the appropriate storage based on the current storage policy, use the mover utility, from any HDFS host. Use mover -h to get a list of available options. To migrate all data at once (this may take a long time), you can set the path to /.
$ hdfs mover -p <path>