This is the documentation for Cloudera Enterprise 6.0.x. Beta software is solely for evaluation and testing purposes, and should not be used in production under any circumstances. For production environments, use Cloudera Enterprise 5. To view the latest Cloudera Enterprise 5 documentation, go to Cloudera Product Documentation.

Deploy Erasure Coding

Before You Begin

Before you can deploy Erasure Coding (EC), perform the following tasks:

  • Verify that the clusters run CDH 6.0 or higher.
  • Determine which EC policy you want to use.
  • Determine if you want to use EC for existing data or new data

Understanding Erasure Coding Policies

The EC policy determines how data is encoded and decoded. An EC policy is comprised of the following parts: codec-number of data blocks-number of parity blocks-cell size.

  • Codec: The erasure codec that the policy uses. It can be XOR or Reed-Solomon (RS)
  • Number of Data Blocks: The number of data blocks per stripe. The higher this number, the more nodes that need to be read when accessing data.
  • Number of Parity Blocks: The number of parity blocks per stripe.
  • Cell Size: The size of one basic unit of striped data.

For example, a RS-10-4-1024k policy has the following attributes:

  • Codec: Reed-Solomon
  • Number of Data Blocks: 10
  • Number of Parity Blocks: 4
  • Cell Size: 1024k

The sum of the number of data blocks and parity blocks is the data stripe width. Therefore, the number of racks should at least equal the stripe width in order for the data to be resistant to rack failures. Ideally, the number of racks exceeds the data stripe width to account for downtime and outages. If there are fewer racks than the data stripe width, HDFS spreads data across multiple nodes to maintain fault tolerance at the node level. To achieve node-level fault tolerance, the number of nodes needs to equal the data stripe width.

For example, in order for a RS-10-4-1024k policy to be rack-failure tolerant, you must have at least 14 racks, 10 racks for the data blocks and 4 racks for the parity blocks. This comes with a tradeoff though. Data must be read from 10 blocks, increasing the read time. Therefore, the larger the cluster and colder the data, the more appropriate it is to use EC policies with large data stripe widths.

Enabling Erasure Coding

Enable EC using the Cloudera Manager Admin Console:

  1. Select Clusters and choose the HDFS cluster you want to enable EC for.
  2. Navigate to the Configuration tab and select the Erasure Coding category.
  3. Configure the EC properties:
    • DataNode Striped Read Timeout: DataNode striped read timeout in milliseconds.
    • DataNode Striped Read Threads: Number of threads used by the DataNode to read striped blocks during background reconstruction work.
    • Erasure Coding Reconstruction Weight: Relative weight of resources used by EC background recovery tasks, which require reading multiple blocks, 6 in the case of RS-6-3-1024k, compared to replicated block recovery, which only requires reading a single replica. Higher values result in fewer reconstruction tasks being able to run concurrently. Blocks required to be read to complete recovery are multiplied by this weight to determine the total weight of the recovery task. These units of weight count against the limit set by dfs.namenode.replication.max-streams.
    • Default Policy when Setting Erasure Coding: The erasure coding policy that will be used when setting an erasure coding policy without specifying which one.
    • Erasure Coding Enabled: Allows erasure coding policies to be enabled and set for directories. Erasure coding is currently not supported and is experimental only.
  4. Run the following command:
    hdfs ec -setPolicy -path directory [-policy policyName]
    • path. Required. Specify the HDFS directory you want to apply the EC policy to.
    • policy. Optional. The EC policy you want to use for the directory you specified. If you do not provide this parameter, the EC policy you specified in step 3 for the Default Policy when Setting Erasure Coding is used.
    This command applies the EC policy to data written after the command is run. It does not apply EC policies to existing data. See Using Erasure Coding for Existing Data.

Using Erasure Coding for Existing Data

In order to use EC with existing data, that data must be copied into a directory that has EC enabled. To copy the data, you can use distcp or Cloudera Manager's BDR.

Using Erasure Coding for New Data

In order to use EC with new data, set the destination for the data to a directory with EC enabled. No action beyond that is required. When data is written to the directory, it will be erasure coded based on the policy you set.