Cloudera Data Science Workbench 1.4.x Requirements and Supported Platforms

This topic lists the software and hardware configuration required to successfully install and run Cloudera Data Science Workbench. Cloudera Data Science Workbench does not support hosts or clusters that do not conform to the requirements listed on this page.

Cloudera Manager and CDH Requirements

Cloudera Data Science Workbench 1.4.x is supported on the following versions of CDH and Cloudera Manager:
  • CDH 5.7 or higher 5.x versions.

  • CSD-based deployments: Cloudera Manager 5.13 or higher 5.x versions.

    Package-based deployments: Cloudera Manager 5.11 or higher 5.x versions.

    All cluster hosts must be managed by Cloudera Manager. Note that all Cloudera Data Science Workbench administrative tasks require root access to the cluster's gateway hosts where Cloudera Data Science Workbench is installed. Therefore, Cloudera Data Science Workbench does not support single-user mode installations.

  • Cloudera Distribution of Apache Spark 2.1 and higher.

Operating System Requirements

Cloudera Data Science Workbench 1.4.x is supported on the following operating systems:
Operating System Versions Notes
RHEL / CentOS / Oracle Linux RHCK 7.2, 7.3, 7.4 When IPv6 is disabled, CDSW installations on version 7.3 fail due to an issue in kernel versions 3.10.0-514 - 3.10.0-693. For details, see https://access.redhat.com/solutions/3039771.
Oracle Linux (UEK - default) 7.3 -
SUSE Linux Enterprise Server (SLES) 12 SP2, 12 SP3 -

A gateway node that is dedicated to running Cloudera Data Science Workbench must use one of the aforementioned supported versions even if the remaining CDH hosts in your cluster are running any of the other operating systems supported by Cloudera Enterprise.

Cloudera Data Science Workbench publishes placeholder parcels for other operating systems as well. However, note that these do not work and have only been included to support mixed-OS clusters.

JDK Requirements

The entire CDH cluster, including Cloudera Data Science Workbench gateway nodes, must use Oracle JDK. OpenJDK is not supported by CDH, Cloudera Manager, or Cloudera Data Science Workbench.

For Red Hat/CentOS deployments in particular, Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction must be enabled on the Cloudera Data Science Workbench gateway nodes.

For more specifics on the versions of Oracle JDK recommended for CDH and Cloudera Manager clusters, and instructions on how to install the Java Cryptography Extension, see the Cloudera Product Compatibility Matrix - Supported JDK Versions.

JDK 8 Requirement for Spark 2.2

CSD-based deployments:

On CSD-based deployments, Cloudera Manager automatically detects the path and version of Java installed on Cloudera Data Science Workbench gateway hosts. You do not need to explicitly set the value for JAVA_HOME unless you want to use a custom location, use JRE, or in the case of Spark 2, force Cloudera Manager to use JDK 1.8 as explained below.

To upgrade your entire CDH cluster to JDK 1.8, see Upgrading to Oracle JDK 1.8.

Package-based deployments:

Set JAVA_HOME to the JDK 8 path in cdsw.conf during the installation process. If you need to modify JAVA_HOME after the fact, restart the master and worker nodes to have the changes go into effect.

Networking and Security Requirements

  • A wildcard subdomain such as *.cdsw.company.com. Wildcard subdomains are used to provide isolation for user-generated content.
  • Disable all pre-existing iptables rules. While Kubernetes makes extensive use of iptables, it’s difficult to predict how pre-existing iptables rules will interact with the rules inserted by Kubernetes. Therefore, Cloudera recommends you use the following commands to disable all pre-existing rules before you proceed with the installation.
    sudo iptables -P INPUT ACCEPT
    sudo iptables -P FORWARD ACCEPT
    sudo iptables -P OUTPUT ACCEPT
    sudo iptables -t nat -F
    sudo iptables -t mangle -F
    sudo iptables -F
    sudo iptables -X
  • Cloudera Data Science Workbench sets the following sysctl options in /etc/sysctl.d/k8s.conf:
    • net.bridge.bridge-nf-call-iptables=1
    • net.bridge.bridge-nf-call-ip6tables=1
    • net.ipv4.ip_forward=1
    Underlying components of Cloudera Data Science Workbench (Docker, Kubernetes, and NFS) require these options to work correctly. Make sure they are not overridden by high-priority configuration such as /etc/sysctl.conf.
  • SELinux must either be disabled or run in permissive mode.
  • Multi-homed networks are supported only with Cloudera Data Science Workbench 1.2.2 (and higher).
  • No firewall restrictions across Cloudera Data Science Workbench or CDH hosts.
  • Non-root SSH access is not allowed on Cloudera Data Science Workbench hosts.
  • localhost must resolve to 127.0.0.1.
  • Cloudera Data Science Workbench does not support DNS servers running on 127.0.0.1:53. This IP address resolves to the container localhost within Cloudera Data Science Workbench containers. As a workaround, use either a non-loopback address or a remote DNS server.

Cloudera Data Science Workbench does not support hosts or clusters that do not conform to these restrictions.

Recommended Hardware Configuration

Cloudera Data Science Workbench hosts are added to your CDH cluster as gateway hosts. The recommended minimum hardware configuration for the master host is:

  • CPU: 16+ CPU (vCPU) cores

  • RAM: 32+ GB RAM

  • Disk
    • Root Volume: 100+ GB.

      The Cloudera Data Science Workbench installer temporarily decompresses the engine image file located in /etc/cdsw/images to the /var/lib/docker/tmp/ directory. If you are going to partition the root volume, make sure you allocate at least 20 GB to /var/lib/docker/tmp so that the installer can proceed without running out of space.

    • Application Block Device or Mount Point (Master Host Only): 1 TB
    • Docker Image Block Device: 1 TB

Scaling Guidelines

New nodes can be added and removed from a Cloudera Data Science Workbench deployment without interrupting any jobs already scheduled on existing hosts. Therefore, it is rather straightforward to increase capacity based on observed usage. At a minimum, Cloudera recommends you allocate at least 1 CPU core and 2 GB of RAM per concurrent session or job. CPU can burst above a 1 CPU core share when spare resources are available. Therefore, a 1 CPU core allocation is often adequate for light workloads. Allocating less than 2 GB of RAM can lead to out-of-memory errors for many applications.

As a general guideline, Cloudera recommends nodes with RAM between 60GB and 256GB, and between 16 and 48 cores. This provides a useful range of options for end users. Note that SSDs are strongly recommended for application data storage. Using standard HDDs can sometimes result in poor application performance.

For some data science and machine learning applications, users can collect a significant amount of data in memory within a single R or Python process, or use a significant amount of CPU resources that cannot be easily distributed into the CDH cluster. If individual users frequently run larger workloads or run workloads in parallel over long durations, increase the total resources accordingly. Understanding your users' concurrent workload requirements or observing actual usage is the best approach to scaling Cloudera Data Science Workbench.

Python Supported Versions

The default Cloudera Data Science Workbench engine (Base Image Version 1) includes Python 2.7.11 and Python 3.6.1. To use PySpark with lambda functions that run within the CDH cluster, the Spark executors must have access to a matching version of Python. For many common operating systems, the default system Python will not match the minor release of Python included in Data Science Workbench.

To ensure that the Python versions match, Python can either be installed on every CDH node or made available per job run using Spark’s ability to distribute dependencies. Given the size of a typical isolated Python environment and the desire to avoid repeated uploads from gateway hosts, Cloudera recommends installing Python 2.7 and 3.6 on the cluster if you are using PySpark with lambda functions.

You can install Python 2.7 and 3.6 on the cluster using any method and set the corresponding PYSPARK_PYTHON environment variable in your project. Cloudera Data Science Workbench 1.3 (and higher) include a separate environment variable for Python 3 sessions called PYSPARK3_PYTHON. Python 2 sessions continue to use the default PYSPARK_PYTHON variable. This will allow you to run Python 2 and Python 3 sessions in parallel without either variable being overridden by the other.

Anaconda - Continuum Analytics and Cloudera have partnered to create an Anaconda parcel for CDH to enable simple distribution, installation, and management of popular Python packages and their dependencies. The public Anaconda parcel ships Python 2.7.11. Note that the Anaconda parcel is not directly supported by Cloudera and no publicly available parcel exists for Python 3.6. For an example on distributing Python dependencies dynamically, see Example: Distributing Dependencies on a PySpark Cluster.

Docker and Kubernetes Support

Cloudera Data Science Workbench only supports the versions of Docker and Kubernetes that are shipped with each release. Upgrading Docker or Kubernetes, or running on third-party Kubernetes clusters is not supported.

Supported Browsers

  • Chrome (latest stable version)
  • Firefox (latest released version and latest ESR version)
  • Safari 9+
  • Internet Explorer (IE) 11+

Cloudera Director Support (AWS and Azure Only)

Starting with Cloudera Data Science Workbench 1.4.x, you can use Cloudera Director to deploy clusters with Cloudera Data Science Workbench.

Cloudera Director support is available for the following platforms:
  • Amazon Web Services (AWS) - Cloudera Director 2.6.0 (and higher)

    Microsoft Azure - Cloudera Director 2.7 (and higher)

  • Cloudera Manager 5.13.1 (and higher)
  • CSD-based Cloudera Data Science Workbench 1.2.x (and higher)

Deploying Cloudera Data Science Workbench with Cloudera Director

Points to note when using Cloudera Director to install Cloudera Data Science Workbench:
  • (Required for Director 2.6) Before you run the command to bootstrap a new cluster, set the lp.normalization.mountAllUnmountedDisksRequired property to false in the Cloudera Director server's application.properties file, and then restart Cloudera Director.

    Higher versions of Cloudera Director do not require this step. Cloudera Director 2.7 (and higher) include an instance-level setting called mountAllUnmountedDisks that must be set to false as demonstrated in the following sample configuration files.

  • Depending on your cloud platform, you can use one of the following sample configuration files to deploy a Cloudera Manager cluster with Cloudera Data Science Workbench.

    Note that these sample files are tailored to Director 2.7 (and higher) and they install a very limited CDH cluster with just the following services: HDFS, YARN, and Spark 2. You can extend them as needed to match your use case.

Recommended Configuration on Amazon Web Services (AWS)

On AWS, Cloudera Data Science Workbench must be used with persistent/long-running Apache Hadoop clusters only.

CDH and Cloudera Manager Hosts
Cloudera Data Science Workbench Hosts
  • Operations
    • Use Cloudera Director to orchestrate operations. Use Cloudera Manager to monitor the cluster.
  • Networking
    • No security group or network restrictions between hosts.
    • HTTP connectivity to the corporate network for browser access. Do not use proxies or manual SSH tunnels.
  • Recommended Instance Types
    • m4.4xlarge–m4.16xlarge

      In this case, bigger is better. That is, one m4.16large is better than four m4.4xlarge hosts. AWS pricing scales linearly, and larger instances have more EBS bandwidth.

  • Storage
    • 100 GB root volume block device (gp2) on all hosts
    • 500 GB Docker block devices (gp2) on all hosts
    • 1 TB Application block device (io1) on master host

Recommended Configuration on Microsoft Azure

CDH and Cloudera Manager Hosts
Cloudera Data Science Workbench Hosts
  • Operations
    • Use Cloudera Director to orchestrate operations. Use Cloudera Manager to monitor the cluster.
  • Networking
    • No security group or network restrictions between hosts.
    • HTTP connectivity to the corporate network for browser access. Do not use proxies or manual SSH tunnels.
  • Recommended Instance Types
    • DS13-DS14 v2 instances on all hosts.
  • Storage
    • P30 premium storage for the Application and Docker block devices.

      Cloudera Data Science Workbench requires premium disks for its block devices on Azure. Standard disks can lead to unacceptable performance even on small clusters.