Cloudera Backup and Disaster Recovery
About this Guide
This guide describes Cloudera Backup and Disaster Recovery (BDR), a separately-licensed Cloudera product that provides an integrated, easy-to-use management solution for enabling data protection in the Hadoop platform.
The information in this guide is also available in the online Help included with Cloudera Manager.
The following sections are covered in this Guide:
Backup and Disaster Recovery Overview
Cloudera Backup and Disaster Recovery (BDR) provides an integrated, easy-to-use management solution for enabling data protection in the Hadoop platform. Cloudera BDR 1.0 provides rich functionality aimed towards replicating data stored in HDFS and accessed through Hive across datacenters for Disaster Recovery scenarios. When critical data is stored on HDFS, Cloudera BDR provides the necessary capabilities to ensure that the data is available at all times, even in the face of the complete shutdown of a datacenter.
Cloudera BDR provides key capabilities that are fully integrated into the Cloudera Manager User Interface:
- Select: Choose the key datasets that are critical for your business operations.
- Schedule: Create an appropriate schedule for data replication – trigger replication as quickly as is appropriate for your business needs.
- Monitor: Track progress of your replication jobs through a central console and easily identify issues or files that failed to be transferred.
- Alert: Issue alerts when a replication job fails or is aborted so that the problem can be diagnosed expeditiously.
These capabilities work seamlessly across Hive and HDFS – replication can be setup on files or directories in the case of HDFS and on tables in the case of Hive – without any manual translation of Hive datasets into HDFS datasets or vice-versa. Hive Metastore information is also replicated which means that the applications that depend upon the table definitions stored in Hive will work correctly on the replica side as well as the source side as table definitions are updated.
Built on top of a hardened version of “distcp” – the replication uses the scalability and availability of MapReduce itself to parallelize the copying of files using a specialized MapReduce job that diffs and transfers only changed files from each Mapper to the replica side efficiently and quickly.
Also available in the new version is the ability to do a “Dry Run” to verify configuration and understand the cost of the overall operation before actually copying the entire dataset.
Since the functionality is implemented as an add-on to Cloudera Manager, all Cloudera BDR functionality is available directly through the Cloudera Manager Admin Console.
Minimum Supported Version: CDH 4.0
License: BDR requires a separate license for each node on the destination side of your replication cluster.
You must ensure that the following ports are open to allow communication between the source and destination Cloudera Manager servers and the HDFS, MapReduce, and Hive hosts to enable replication jobs to work:
- Cloudera Manager Admin Console port: Default is 7180.
- HDFS NameNode port: Default is 8020.
- HDFS DataNode port: Default is 50010.
See Configuring Ports for Cloudera Manager for more information, including how to verify the current values for these ports.