Many Cloudera customers are making the transition from being completely on-prem to cloud by either backing up their data in the cloud, or running multi-functional analytics on CDP Public cloud in AWS or Azure.
The Replication Manager service facilitates both disaster recovery and data migration across different environments. Using easy-to-define policies, Replication Manager solves one of the biggest barriers for the customers in their cloud adoption journey by allowing them to move both tables/structured data and files/unstructured data to the CDP cloud of their choice easily.
The policies can be crafted to support periodic replication which allows the migration of workloads to the cloud to be under the control of the end user.
Replication Manager can be used to migrate Apache Hive, Apache Impala, and HDFS objects from CDH clusters to CDP Public Cloud clusters.
The Replication Manager support matrix is documented in our public docs.
This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake.
For context, the setup used is as follows.
Along with ensuring that both CDH and CDP Data Lake clusters meet the support matrix requirements, both these clusters also need to meet following requirements.
This step involves checking the following configs on source:
In order to copy or migrate data from CDH cluster to CDP Data Lake cluster, the on-prem CDH cluster should be able to access the CDP cloud storage.
To enable this, an External Account is configured on the CDH Cluster. In our example we use the following External Account.
We have used an access key / secret key pair to ensure that the External Account configured on CDH can successfully access the S3 buckets used by Data Lake.
The Sentry service serves authorization metadata from the database backed storage; it does not handle actual privilege validation. The Hive, Impala, and Solr services are clients of this service and will enforce Sentry privileges when they are configured to use Sentry.
More details about using and managing Sentry.
Replication Manager allows administrators to migrate the existing Sentry permissions from the source CDH cluster to the Ranger policies in CDP Public Cloud.
The Apache Ranger access policy model consists of two major components:
The hdfs user should have access to all Hive datasets, including all operations. Else, Hive import fails during the replication process. To provide access, follow these steps:
Hadoop SQL Policies overview
Before the Replication Manager can use the CDH cluster as a source cluster, it is a mandatory step to register the CDH Cluster as a Classic Cluster under CDP Public Cloud control plane.
The following article covers how to register a CDH Cluster as a Classic Cluster in the CDP control plane.
Replication Manager will be able to list the classic clusters that have HDFS, Hive, YARN and other necessary services that are currently accessible/running.
Thus, if a source Cloudera Manager is managing Cluster 1 and Cluster 2, which are running the necessary services and are currently registered/accessible, either of the clusters can be used as a source.
During Hive replication, the following data sets can be replicated from the CDH cluster for the specified databases (and tables).
In this blog post, the Hive table, “default:customers” is being replicated from the CDH cluster to the CDP Data Lake cluster.
The various details of this table are as follows:
The Replication Manager wizard prompts various steps to create a Hive replication policy.
These steps are outlined as follows.
Step 1: Provide the policy name and description here. Select Hive and click Next.
Step 2: Select the registered CDH classic cluster from the drop down list if you have multiple classic clusters and click Next.
Step 3: Select the Destination Data Lake Cluster and provide the name of the Cloud Credential (present as External Account on the CDH cluster or create a new Cloud Credential on the selected source cluster). Click Next.
Step 4: Determine the schedule frequency for the policy to execute, and then click Next.
Step 5: Provide details about the Export task which is executed on the source cluster and click Create.
The replication policy is now created and the admin can now check its status.
Visit this community article for step-by-step details.
Visit this community article for step-by-step details.
Once the Hive replication policy has successfully executed, the admin can perform the following validations to ensure that replication was indeed successful:
The replication manager has a comprehensive troubleshooting guide. This blog post is not a substitute for that. However it lists the most common errors faced by the administrators while using Replication Manager.
Once admin has pre-created all the required Groups in Ranger, the Sentry to Ranger imports completes successfully.
The key purpose of this blog post is to explain a detailed step by step workflow which is involved in performing Hive replication between a CDH cluster and CDP data lake cluster.
Each cluster configuration has subtle differences and hence the steps should help administrators manage those differences in order to successfully create a Hive replication policy using Replication Manager.
If you seek further clarification while running a Hive replication policy, please provide your feedback in the comments or on Cloudera Community and we will make sure to address it in the next revision.
This may have been caused by one of the following: