Installing and Upgrading Cloudera Data Science Workbench 1.4.x Using Packages

This topic describes how to install and upgrade the Cloudera Data Science Workbench package on a CDH cluster managed by Cloudera Manager.

Installing Cloudera Data Science Workbench 1.4.x from Packages

Prerequisites

Before you begin installing Cloudera Data Science Workbench, make sure you have completed the steps to configure your hosts and block devices.

Configure Gateway Hosts Using Cloudera Manager

Cloudera Data Science Workbench hosts must be added to your CDH cluster as gateway hosts, with gateway roles properly configured. To configure gateway hosts:
  1. If you have not already done so and plan to use PySpark, install either the Anaconda parcel or Python (versions 2.7.11 and 3.6.1) on your CDH cluster. For more information see, Python Supported Versions.

  2. To support workloads running on CDS 2.x Powered by Apache Spark, you must configure the Spark 2 parcel and the Spark 2 CSD. For instructions, see Installing Cloudera Distribution of Apache Spark 2.

    To be able to use Spark 2, each user must have their own /home directory in HDFS. If you sign in to Hue first, these directories will automatically be created for you. Alternatively, you can have cluster administrators create these directories.
    hdfs dfs -mkdir /user/<username>
    hdfs dfs -chown <username>:<username> /user/<username>
  3. Use Cloudera Manager to create add gateway hosts to your CDH cluster.
    1. Create a new host template that includes gateway roles for HDFS, YARN, and Spark 2.
    2. Use the instructions at Adding a Host to the Cluster to add gateway hosts to the cluster. Apply the template created in the previous step to these gateway hosts. If your cluster is kerberized, confirm that the krb5.conf file on your gateway hosts is correct.
  4. Test Spark 2 integration on the gateway hosts.
    1. SSH to a gateway host.
    2. If your cluster is kerberized, run kinit to authenticate to the CDH cluster’s Kerberos Key Distribution Center. The Kerberos ticket you create is not visible to Cloudera Data Science Workbench users.
    3. Submit a test job to Spark 2 by executing the following command:
      spark2-submit --class org.apache.spark.examples.SparkPi 
      --master yarn --deploy-mode client 
      /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-example*.jar 100

Install Cloudera Data Science Workbench on the Master Node

CDSW 1.4.x is no longer available for installation. Refer to the CDSW documentation for information on suppported versions.

(Optional) Install Cloudera Data Science Workbench on Worker Nodes

CDSW 1.4.x is no longer available for installation. Refer to the CDSW documentation for information on suppported versions.

Create the Administrator Account

Installation typically takes 30 minutes, although it might take an additional 60 minutes for the R, Python, and Scala engine to be available on all hosts.

After your installation is complete, set up the initial administrator account. Go to the Cloudera Data Science Workbench web application at http://cdsw.<company>.com.

The first account that you create becomes the site administrator. You may now use this account to create a new project and start using the workbench to run data science workloads. For a brief example, see Getting Started with the Cloudera Data Science Workbench.

Next Steps

As a site administrator, you can invite new users, monitor resource utilization, secure the deployment, and upload a license key for the product. For more details on these tasks, see the Administration and Security guides.

You can also start using the product by configuring your personal account and creating a new project. For a quickstart that walks you through creating a simple template project, see Getting Started with Cloudera Data Science Workbench. For more details on collaborating with teams, working on projects, and sharing results, see the Managing Cloudera Data Science Workbench Users.

Upgrading to the Latest Version of Cloudera Data Science Workbench 1.4.x Using Packages

Before you start upgrading Cloudera Data Science Workbench, read the Cloudera Data Science Workbench Release Notes relevant to the version you are upgrading to.

  1. (Strongly Recommended) Safely stop Cloudera Data Science Workbench. To avoid running into the data loss issue described in TSB-346, run the cdsw_protect_stop_restart.sh script on the master node and follow the sequence of steps as instructed by the script.

    The script will first back up your project files to the specified target folder. It will then temporarily move your project files aside to protect against the data loss condition. At that point, it is safe to stop Cloudera Data Science Workbench. To stop Cloudera Data Science Workbench, run the following command on all Cloudera Data science Workbench nodes (master and workers):
    cdsw reset

    After Cloudera Data Science Workbench has stopped, press enter to continue running the script as instructed. It will then move your project files back into place.

  2. (Strongly Recommended) On the master node, backup all your application data that is stored in the /var/lib/cdsw directory, and the configuration file at /etc/cdsw/config/cdsw.conf.
    To create the backup, run the following command on the master host.
    tar cvzf cdsw.tar.gz /var/lib/cdsw/*
  3. (Required for Upgrades from CDSW 1.4.0 - RedHat only) Cloudera Data Science Workbench 1.4.2 (and higher) includes a fix for a slab leak issue found in RedHat kernels. To have this fix go into effect, RedHat users must reboot all Cloudera Data Science Workbench hosts before proceeding with the upgrade.

    As a precaution, consult your cluster/IT administrator before you start rebooting hosts.

  4. Uninstall the previous release of Cloudera Data Science Workbench. Perform this step on the master node, as well as all the worker nodes.
    yum remove cloudera-data-science-workbench 
  5. Install the latest version of Cloudera Data Science Workbench on the master node and on all the worker nodes. During the installation process, you might need to resolve certain incompatibilities in cdsw.conf. Even though you will be installing the latest RPM, your previous configuration settings in cdsw.conf will remain unchanged. Depending on the release you are upgrading from, you will need to modify cdsw.conf to ensure it passes the validation checks run by the 1.4.x release.

    To install the latest version of Cloudera Data Science Workbench, follow the same process to install the package as you would for a fresh installation.

    1. Install Cloudera Data Science Workbench on the Master Node
    2. (Optional) Install Cloudera Data Science Workbench on Worker Nodes.
  6. Post-Upgrade Tasks for Cloudera Data Science Workbench 1.4.x

    Check for New Base Engine

    If the release you have just upgraded to includes a new version of the base engine image (see release notes), you will need to manually configure existing projects to use the new engine. Cloudera recommends you do so to take advantage of any new features and bug fixes included in the newly released engine.

    To upgrade a project to the new engine, go to the project's Settings > Engine page and select the new engine from the dropdown. If any of your projects are using custom extended engines, you will need to modify them to use the new base engine image.