Installing and Upgrading Cloudera Data Science Workbench 1.3.x

This topic walks you through the installation and upgrade paths available for Cloudera Data Science Workbench 1.3.x. It also describes the steps needed to configure your cluster gateway hosts and block devices before you can begin installing the Cloudera Data Science Workbench parcel/package.

Installing Cloudera Data Science Workbench 1.3.x

You can use one of the following ways to install Cloudera Data Science Workbench 1.3.x:
  • Using a Custom Service Descriptor (CSD) and Parcel - Starting with version 1.2.x, Cloudera Data Science Workbench is available as an add-on service for Cloudera Manager 5.13.x. Two files are required for this type of installation: a CSD JAR file that contains all the configuration needed to describe and manage the new Cloudera Data Science Workbench service, and the Cloudera Data Science Workbench parcel. To install this service, first download and copy the CSD file to the Cloudera Manager Server host. Then use Cloudera Manager to distribute the Cloudera Data Science Workbench parcel to the relevant gateway nodes.

    or

  • Using a Package - Alternatively, you can install the Cloudera Data Science Workbench package directly on the CDH cluster's gateway nodes. In this case, the Cloudera Data Science Workbench service will not be available in Cloudera Manager.

To begin the installation process, continue reading Pre-Installation.

Upgrading to the Latest Version of Cloudera Data Science Workbench 1.3.x

Depending on your deployment, choose from one of the following upgrade paths:

Airgapped Installations

Sometimes organizations choose to restrict parts of their network from the Internet for security reasons. Isolating segments of a network can provide assurance that valuable data is not being compromised by individuals out of maliciousness or for personal gain. However, in such cases isolated hosts are unable to access Cloudera repositories for new installations or upgrades. Effective version 1.1.1, Cloudera Data Science Workbench supports installation on CDH clusters that are not connected to the Internet.

For CSD-based installs in an airgapped environment, put the Cloudera Data Science Workbench parcel into a new hosted or local parcel repository, and then configure the Cloudera Manager Server to target this newly-created repository.

Rollback Cloudera Data Science Workbench

All stateful data for Cloudera Data Science Workbench is stored in the /var/lib/cdsw directory on the Master node. The contents of this directory are forward compatible, which is what allows for upgrades. However, they are not backward compatible. Therefore, to rollback Cloudera Data Science Workbench to a previous version, you must have a backup of the /var/lib/cdsw directory, taken prior to the last upgrade.

In general, the steps required to restore a previous version of Cloudera Data Science Workbench are:
  1. Depending on your deployment, either uninstall the RPM or deactivate the current CDSW parcel in Cloudera Manager.
  2. On the master node, restore the backup copy you have of /var/lib/cdsw. Note that any changes after this backup will be lost.
  3. Install a version of Cloudera Data Science Workbench that is equal to or greater than the version of the /var/lib/cdsw backup.

Pre-Installation

The rest of this topic describes the steps you should take to review your platforms and configure your hosts before you begin to install Cloudera Data Science Workbench.

  1. Review Requirements and Supported Platforms
  2. Set Up a Wildcard DNS Subdomain
  3. Disable Untrusted SSH Access
  4. Configure Block Devices
  5. Install Cloudera Data Science Workbench

Review Requirements and Supported Platforms

Review the complete list of Cloudera Data Science Workbench 1.3.x Requirements and Supported Platforms before you proceed with the installation.

Set Up a Wildcard DNS Subdomain

Cloudera Data Science Workbench uses subdomains to provide isolation for user-generated HTML and JavaScript, and routing requests between services.. To access Cloudera Data Science Workbench, you must configure the wildcard DNS name *.cdsw.<your_domain>.com for the master host as an A record, along with a root entry for cdsw.<your_domain>.com.

For example, if your master IP address is 172.46.47.48, configure two A records as follows:

cdsw.<your_domain>.com.   IN A 172.46.47.48
*.cdsw.<your_domain>.com.   IN A 172.46.47.48

You can also use a wildcard CNAME record if it is supported by your DNS provider.

Disable Untrusted SSH Access

Cloudera Data Science Workbench assumes that users only access the gateway hosts through the web application. Untrusted users with SSH access to a Cloudera Data Science Workbench host can gain full access to the cluster, including access to other users' workloads. Therefore, untrusted (non-sudo) SSH access to Cloudera Data Science Workbench hosts must be disabled to ensure a secure deployment.

For more information on the security capabilities of Cloudera Data Science Workbench, see the Cloudera Data Science Workbench Security Guide.

Configure Block Devices

Docker Block Device

The Cloudera Data Science Workbench installer will format and mount Docker on each gateway host. Make sure there is no important data stored on these devices. Do not mount these block devices prior to installation.

Every Cloudera Data Science Workbench gateway host must have one or more block devices with at least 500 GB dedicated to storage of Docker images. The Docker block devices store the Cloudera Data Science Workbench Docker images including the Python, R, and Scala engines. Each engine image can weigh 15GB.

Application Block Device or Mount Point

The master host on Cloudera Data Science Workbench requires at least 500 GB for database and project storage. This recommended capacity is contingent on the expected number of users and projects on the cluster. While large data files should be stored on HDFS, it is not uncommon to find gigabytes of data or libraries in individual projects. Running out of storage will cause the application to fail. Cloudera recommends allocating at least 5 GB per project and at least 1 TB of storage in total. Make sure you continue to carefully monitor disk space usage and I/O using Cloudera Manager.

Cloudera Data Science Workbench will store all application data at /var/lib/cdsw. In a CSD-based deployment, this location is not configurable. Cloudera Data Science Workbench will assume the system administrator has formatted and mounted one or more block devices to /var/lib/cdsw.

Regardless of the application data storage configuration you choose, /var/lib/cdsw must be stored on a separate block device. Given typical database and user access patterns, an SSD is strongly recommended.

By default, data in /var/lib/cdsw is not backed up or replicated to HDFS or other nodes. Reliable storage and backup strategy is critical for production installations. See Backup and Disaster Recovery for Cloudera Data Science Workbench for more information.

Install Cloudera Data Science Workbench

To use the Cloudera Manager CSD and parcel to install Cloudera Data Science Workbench, follow the steps at Installation and Upgrade Using Cloudera Manager.

OR

To install the Cloudera Data Science Workbench package on the cluster gateway hosts, follow the steps at Installation and Upgrade Using Packages.