This is the documentation for Cloudera Manager 4.8.3.
Documentation for other versions is available at Cloudera Documentation.

Hive Replication

Hive Replication enables you to copy (backup) and keep in sync the Hive Metastore and data from clusters managed by a remote peer or local Cloudera Manager server, and keep the copy on a cluster managed by your local Cloudera Manager server (the server whose Admin console you are currently logged into). You can add Peers though the Administration > Peers tab (see Designating a Replication Source).

You can use the Add Peer link on the Replication page to go to the Peers page to add a new peer Cloudera Manager server.

Once you have a peer relationship set up with a Cloudera Manager server, you can configure replication of your Hive Metastore data.

  Note:
  • Hive replication between CDH4.2 or later and CDH4.0 does not work if the Hive schema contains views.
  • If data replication is desired, Hive replication does not work between a source cluster that has encryption enabled and a target cluster running CDH 4.0. This is because the CDH 4.0 client used for replication does not support encryption.
  • Hive replication (even without data replication) does not work between a source cluster running CDH 4.0 and a target cluster that has encryption enabled.
  1. From the Services tab, go to the CDH4 Hive service where you want to host the replicated data.
  2. Click the Replication tab at the top of the page.
  3. Select the Hive service to be the source of the replicated data. If the peer Cloudera Manager Server has multiple CDH4 Hive services (for example, if it is managing multiple CDH4 clusters) you will be able to select the service you want to use as the source. If the peer whose Hive service you want is not listed, click the Add Peer link to go to the Peers page to add a Cloudera Manager peer. When you select a replication source, the Create Replication pop-up opens.
  4. Leave Replicate All checked to replicate all the Hive metastore databases from the source. To replicate only selected databases, uncheck this option and enter the Database name(s) and tables you want to replicate.
    • You can specify multiple data bases and tables using the plus symbol to add more rows to the specification.
    • You can specify multiple databases on a single line by separating their names with the "pipe" character. For example:

      mydbname1|mydbname2|mydbname3

    • Regex can be used in either Database or Table fields. For example:
      [\w_]+
      any database/table name
      (?!\\b(myname)\\b).*
      any database/table except the one named "myname"
      db1|db2
      [\w_]+
      To get all tables of the db1 and db2 databases
      db1
      [\w_]+

      Click the "+" button and then enter

      db2
      [\w_]+
      Alternate way to get all tables of the db1 and db2 databases
  5. Select the target destination. If there is only one Hive service managed by Cloudera Manager available as a target, then this will be specified as the target. If there are more than one Hive services managed by this Cloudera Manager, then you will be able to select among those.
  6. Select a schedule: You can have it run immediately, run once at a scheduled time in the future, or at regularly scheduled intervals. If you select "Once" or "Recurring" you are presented with fields that let you set the date and time and (if appropriate) the interval between runs.
  7. Uncheck the Replicate HDFS Files checkbox to skip replicating the associated data files — if you uncheck this, only the Hive metadata will be replicated. These are replicated to a default location; to specify a different location, you can change the Export Path and Destination under the More Options section, described below.
  8. Use the More Options section to specify an export location, modify the parameters of the MapReduce job that will perform the replication, and other options. Here you will be able to select a MapReduce service (if there is more than one in your cluster) and change the following parameters:
    • By default, Cloudera Manager exports the Hive Metadata to a default HDFS location (/user/${user.name}/.cm/hive) and then imports from this HDFS file to the target Hive Metastore. The default HDFS location for this export file can be overridden by specifying a path in the Export Path field.
    • The Force Overwrite option, if checked, forces overwriting data in the target metastore if there are incompatible changes detected. For example, if the target metastore was modified and a new partition was added to a table, this option would force deletion of that partition, overwriting the table with the version found on the source.
        Important: If the Force Overwrite option is not set and the Hive replication process detects incompatible changes on the source cluster, Hive replication will fail.
    • By default, Cloudera Manager replicates Hive's HDFS data files to a default location (/). To override the default, enter a path in the Destination field.
    • Select the MapReduce service to use for this replication (if there is more than one in your cluster). The user is set in the Run As option.
    • To specify the user that should run the MapReduce job, use the Run As option. By default MapReduce jobs run as hdfs. If you want to run the MR job as a different user, you can enter that here. If you are using Kerberos, you MUST provide a user name here, and it must be one with an ID greater than 1000.
    • An alternative path for the logs.
    • Limits for the number of map slots and for bandwidth per mapper. The defaults are unlimited.
    • Whether to abort the job on an error (default is not to abort the job). Check the checkbox to enable this. This means that files copied up to that point will remain on the destination, but no additional files will be copied.
    • Whether to skip checksum checks (default is to perform them).
    • Whether to remove deleted files from the target directory if they have been removed on the source.
    • Whether to preserve the block size, replication count, and permissions as they exist on the source file system, or to use the settings as configured on the target file system. The default is to preserve these settings as on the source.
        Note: If you leave the setting to preserve permissions, then you must be running as a superuser. You can use the "Run as" option to ensure that is the case.
    • Whether to generate alerts for various state changes in the replication workflow. You can alert on failure, on start, on success, or when the replication workflow is aborted.
  9. Click Save Schedule to save the replication specs.

When saved, the replication job appears in the Replication list, with relevant information about the source and target locations, and the timestamp of the last run and the next scheduled run (if there is a recurring schedule). A scheduled job will show a calendar icon to the left of the job specification. If it is scheduled to run once, the calendar icon will disappear after the job has run.

To specify additional replication tasks, click the Create Replication button that appears once you have added the first replication tasks.

If the replication failed, the timestamp will appear in Red text.

Note that only one replication can occur at a time; if another replication job starts before the previous one has finished, the second one is cancelled.

You can test the replication task without actually transferring data using the "Dry Run" feature:

  • From the Actions menu for the replication task you want to test, click Dry Run.

From the Actions menu for a replication task, in addition to Dry Run you can also:

  • Edit the task configuration
  • Run the task (immediately)
  • Delete the task
  • Disable/Enable the job (if the job is on a recurring schedule). When a task is disabled, instead of the calendar icon you will see a Stopped icon, and the job entry will appear in grey. Disabling and enabling a job is only available if the job is on a recurring schedule.

Viewing Replication Job Status

While a run is in progress, the calendar icon turns into spinner, and each stage of the replication task is indicated in the message after the replication specification.

  • If the replication is successful, the number of files copied is indicated. If there have been no changes to a file at the source since the previous replication, then that file will not be copied. As a result, after the initial replication run, only a subset of the files may actually be copied, and this will be indicated in the success message.
  • The Replication task can be aborted. While replication is running (the spinner is running), click on Commands and it will show the Hive Replication with another spinner, and next to it an Abort button. Click the Abort button to terminate the task. If the remote export task is still running, the Abort will cause the remote task to be terminated also.
  • If the replication fails, that will be indicated and the timestamp will appear in Red text.
  • To view more information about completed replication runs, click anywhere in the replication job entry row in the replication list. This displays sub-entries for each past replication run.
  • To view detailed information about a particular past run, click the entry for that replication run. This opens another sub-entry that shows:
    • A result message
    • The start and end time of the replication job.
    • A link to the command details for that replication run.
    • Details about the data that was replicated.
  • When viewing a sub-entry, you can dismiss the sub-entry by clicking anywhere in its parent entry, or by clicking the return arrow icon !at the top left of the sub-entry area.