Backup and Disaster Recovery for Cloudera Data Science Workbench

All application data for Cloudera Data Science Workbench, including project files and database state, is stored on the master node at /var/lib/cdsw. Given typical access patterns, it is strongly recommended that /var/lib/cdsw be stored on a dedicated SSD block device or SSD RAID configuration. Because application data is not replicated to HDFS or backed up by default, site administrators must enable a backup strategy to meet any disaster recovery scenarios.

Cloudera strongly recommends both regular backups and backups before upgrades and is not responsible for any data loss.

Creating a Backup

  1. Cloudera Data Science Workbench 1.4.0 or lower

    Do not stop or restart Cloudera Data Science Workbench without using the cdsw_protect_stop_restart.sh script. This is to help avoid the data loss issue detailed in TSB-346.

    Run the script on your master node and stop Cloudera Data Science Workbench (instructions below) only when instructed to do so by the script. Then proceed with step 2 of this process.

    Cloudera Data Science Workbench 1.4.2 or higher

    Depending on your deployment, use one of the following sets of instructions to stop the application.

    To stop Cloudera Data Science Workbench:
    • CSD - Log in to Cloudera Manager. On the Home > Status tab, click to the right of the CDSW service and select Stop from the dropdown. Wait for the action to complete.

      OR

    • RPM - Run the following command on the master node.
      cdsw stop
  2. To create the backup, run the following command on the master host.
    tar cvzf cdsw.tar.gz /var/lib/cdsw/*
  3. (Optional) If needed, the following command can be used to unpack the tar bundle.
    tar xvzf cdsw.tar.gz -C /var/lib/cdsw