Installing and Upgrading Cloudera Data Science Workbench 1.1.x

This topic describes how to install the Cloudera Data Science Workbench package on a CDH cluster managed by Cloudera Manager. Currently, we do not support a Custom Service Descriptor (CSD) or parcel-based installs.

Airgapped Installations - Sometimes organizations choose to restrict parts of their network from the Internet for security reasons. Isolating segments of a network can provide assurance that valuable data is not being compromised by individuals out of maliciousness or for personal gain. However, in such cases isolated hosts are unable to access Cloudera repositories for new installations or upgrades. Starting with version 1.1.1, Cloudera Data Science Workbench supports installation on CDH clusters that are not connected to the Internet.

The rest of this topic describes how to install and upgrade Cloudera Data Science Workbench for both kinds of deployments: clusters with access to the Internet, and airgapped clusters. While there are no major differences in the installation experience for the two usecases, any instructions specific to airgapped deployments have been noted inline.

Prerequisites

Review the complete list of prerequisites at Cloudera Data Science Workbench 1.1.x Requirements and Supported Platforms before you proceed with the installation.

Installing Cloudera Data Science Workbench 1.1.x from Packages

Set Up a Wildcard DNS Subdomain

Cloudera Data Science Workbench uses subdomains to provide isolation for user-generated HTML and JavaScript, and routing requests between services.. To access Cloudera Data Science Workbench, you must configure the wildcard DNS name *.cdsw.<company>.com for the master host as an A record, along with a root entry for cdsw.<company>.com.

For example, if your master IP address is 172.46.47.48, configure two A records as follows:

cdsw.<company>.com.   IN A 172.46.47.48
*.cdsw.<company>.com.   IN A 172.46.47.48

You can also use a wildcard CNAME record if it is supported by your DNS provider.

Disable Untrusted SSH Access

Cloudera Data Science Workbench assumes that users only access the gateway hosts through the web application. Untrusted users with SSH access to a Cloudera Data Science Workbench host can gain full access to the cluster, including access to other users' workloads. Therefore, untrusted (non-sudo) SSH access to Cloudera Data Science Workbench hosts must be disabled to ensure a secure deployment.

For more information on the security capabilities of Cloudera Data Science Workbench, see the Cloudera Data Science Workbench Security Guide.

Configure Gateway Hosts Using Cloudera Manager

Cloudera Data Science Workbench hosts must be added to your CDH cluster as gateway hosts, with gateway roles properly configured. To configure gateway hosts:
  1. If you have not already done so and plan to use PySpark, install either the Anaconda parcel or Python (versions 2.7.11 and 3.6.1) on your CDH cluster. For more information see, Python Supported Versions.

  2. To support workloads running on Cloudera's Distribution of Apache Spark 2, you must configure the Spark 2 parcel and the Spark 2 CSD. For instructions, see Installing Cloudera Distribution of Apache Spark 2.

    To be able to use Spark 2, each user must have their own /home directory in HDFS. If you sign in to Hue first, these directories will automatically be created for you. Alternatively, you can have cluster administrators create these directories.
    hdfs dfs -mkdir /user/<username>
    hdfs dfs -chown <username>:<username> /user/<username>
  3. Use Cloudera Manager to create add gateway hosts to your CDH cluster.
    1. Create a new host template that includes gateway roles for HDFS, YARN, and Spark 2.
    2. Use the instructions at Adding a Host to the Cluster to add gateway hosts to the cluster. Apply the template created in the previous step to these gateway hosts. If your cluster is kerberized, confirm that the krb5.conf file on your gateway hosts is correct.
  4. Test Spark 2 integration on the gateway hosts.
    1. SSH to a gateway host.
    2. If your cluster is kerberized, run kinit to authenticate to the CDH cluster’s Kerberos Key Distribution Center. The Kerberos ticket you create is not visible to Cloudera Data Science Workbench users.
    3. Submit a test job to Spark 2 by executing the following command:
      spark2-submit --class org.apache.spark.examples.SparkPi 
      --master yarn --deploy-mode client 
      /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-example*.jar 100

Configure Block Devices

Docker Block Device

Cloudera Data Science Workbench installer will format and mount Docker on each gateway host. Do not mount these block devices prior to installation.

Every Cloudera Data Science Workbench gateway host must have one or more block devices with at least 500 GB dedicated to storage of Docker images. The Docker block devices store the Cloudera Data Science Workbench Docker images including the Python, R, and Scala engines. Each engine image can weigh 15GB.

Application Block Device or Mount Point

The master host on Cloudera Data Science Workbench requires at least 500 GB for database and project storage. This recommended capacity is contingent on the expected number of users and projects on the cluster. While large data files should be stored on HDFS, it is not uncommon to find gigabytes of data or libraries in individual projects. Running out of storage will cause the application to fail. Cloudera recommends allocating at least 5 GB per project and at least 1 TB of storage in total. Make sure you continue to carefully monitor disk space usage and I/O using Cloudera Manager.

All application data will be located at /var/lib/cdsw on the master node. If an application block device is specified during initialization, Cloudera Data Science Workbench will format it as ext4 and mount it to /var/lib/cdsw. If no device is explicitly specified during initialization, Cloudera Data Science Workbench will store all data at /var/lib/cdsw and assume the system administrator has formatted and mounted one or more block devices to this location. The second option is recommended for production installations.

Regardless of the application data storage configuration you choose, /var/lib/cdsw must be stored on a separate block device. Given typical database and user access patterns, an SSD is strongly recommended.

By default, data in /var/lib/cdsw is not backed up or replicated to HDFS or other nodes. Reliable storage and backup strategy is critical for production installations. See Backup and Disaster Recovery for Cloudera Data Science Workbench for more information.

Install Cloudera Data Science Workbench on the Master Node

CDSW 1.1.x is no longer available for download. Refer to the CDSW documentation for information on suppported versions.

(Optional) Install Cloudera Data Science Workbench on Worker Nodes

CDSW 1.1.x is no longer available for installation. Refer to the CDSW documentation for information on suppported versions.

Create the Administrator Account

Installation typically takes 30 minutes, although it might take an additional 60 minutes for the R, Python, and Scala engine to be available on all hosts.

After your installation is complete, set up the initial administrator account. Go to the Cloudera Data Science Workbench web application at http://cdsw.<company>.com.

The first account that you create becomes the site administrator. You may now use this account to create a new project and start using the workbench to run data science workloads. For a brief example, see Getting Started with the Cloudera Data Science Workbench.

Next Steps

As a site administrator, you can invite new users, monitor resource utilization, secure the deployment, and upload a license key for the product. For more details on these tasks, see the Administration and Security guides.

You can also start using the product by configuring your personal account and creating a new project. For a quickstart that walks you through creating a simple template project, see Getting Started with Cloudera Data Science Workbench. For more details on collaborating with teams, working on projects, and sharing results, see the Cloudera Data Science Workbench User Guide.

Upgrading to the Latest Version of Cloudera Data Science Workbench 1.1.x

  1. (Strongly Recommended) Safely stop Cloudera Data Science Workbench. To avoid running into the data loss issue described in TSB-346, run the cdsw_protect_stop_restart.sh script on the master node and follow the sequence of steps as instructed by the script.

    The script will first back up your project files to the specified target folder. It will then temporarily move your project files aside to protect against the data loss condition. At that point, it is safe to stop Cloudera Data Science Workbench. To stop Cloudera Data Science Workbench, run the following command on all Cloudera Data science Workbench nodes (master and workers):
    cdsw reset

    After Cloudera Data Science Workbench has stopped, press enter to continue running the script as instructed. It will then move your project files back into place.

  2. (Strongly Recommended) On the master node, backup the contents of the /var/lib/cdsw directory. This is the directory that stores all your application data.
  3. Uninstall the previous release of Cloudera Data Science Workbench. Perform this step on the master node, as well as all the worker nodes.
    yum remove cloudera-data-science-workbench 
  4. Install the latest version of Cloudera Data Science Workbench on the master node and on all the worker nodes. During the installation process, you will need to resolve certain incompatibilities in cdsw.conf. Even though you will be installing the latest RPM, your previous configuration settings in cdsw.conf will remain unchanged. Depending on the release you are upgrading from, you will need to modify cdsw.conf to ensure it passes the validation checks run by the 1.1.x release.
    Key Changes to Note
    • JAVA_HOME is now a required parameter. Make sure you add JAVA_HOME to cdsw.conf before you start Cloudera Data Science Workbench.
    • Previous versions allowed MASTER_IP to be set to a DNS hostname. If you are still using a DNS hostname, switch to an IP address.

    To install the latest version of Cloudera Data Science Workbench, follow the same process to install the package as you would for a fresh installation.

    1. Install Cloudera Data Science Workbench on the Master Node
    2. (Optional) Install Cloudera Data Science Workbench on Worker Nodes.
  5. If the release you have just upgraded to includes a new version of the base engine image (see release notes), you need to manually configure existing projects to use the new engine. Cloudera recommends you do so to take advantage of any new features and bug fixes included in the newly released engine.

    To upgrade a project to the new engine, go to the project's Settings > Engine page and select the new engine from the dropdown. If any of your projects are using custom extended engines, you will need to modify them to use the new base engine image.