Cloudera Data Science Workbench 1.0.x Requirements and Supported Platforms
This topic lists the software and hardware configuration required to successfully install and run Cloudera Data Science Workbench. Cloudera Data Science Workbench does not support hosts or clusters that do not conform to the requirements listed on this page.
Cloudera Manager and CDH Requirements
- CDH 5.7 or higher.
- Cloudera Manager 5.11 or higher. All cluster hosts must be managed by Cloudera Manager.
- Cloudera Data Science Workbench requires Cloudera's Distribution of Apache Spark 2.1.
Operating System Requirements
Cloudera Data Science Workbench is currently supported only on RHEL/CentOS 7.2.
A gateway node that is dedicated to running Cloudera Data Science Workbench must use RHEL/CentOS 7.2 even if the remaining hosts in your cluster are running any of the other supported operating systems. At this time, other distributions are not supported with Cloudera Data Science Workbench.
The entire CDH cluster, including Cloudera Data Science Workbench gateway nodes, must use Oracle JDK. OpenJDK is not supported by CDH, Cloudera Manager, or Cloudera Data Science Workbench.
For more specifics on the versions of Oracle JDK recommended for CDH and Cloudera Manager clusters, see the Cloudera Product Compatibility Matrix - Supported JDK Versions.
Networking and Security Requirements
- A wildcard subdomain such as *.cdsw.company.com. Wildcard subdomains are used to provide isolation for user-generated content.
- No pre-existing iptables rules.
Kubernetes makes extensive use of iptables. However, it’s hard to know how pre-existing iptables rules will interact with the rules inserted by Kubernetes. Therefore, Cloudera recommends you disable all pre-existing rules before you proceed with the installation.
- SELinux must be disabled.
- No firewall restrictions across Cloudera Data Science Workbench or CDH hosts.
- No multi-homed networks.
- Non-root SSH access is not allowed on Cloudera Data Science Workbench hosts.
Cloudera Data Science Workbench does not support hosts or clusters that do not conform to these restrictions.
Recommended Hardware Configuration
Cloudera Data Science Workbench hosts are added to your CDH cluster as gateway hosts. The recommended minimum hardware configuration for the master host is:
- CPU: 16+ CPU (vCPU) cores
- RAM: 32+ GB RAM
- Root Volume: 100+ GB
- Application Block Device or Mount Point (Master Host Only): 500+ GB
- Docker Image Block Device: 500+ GB
Python Supported Versions
The default Cloudera Data Science Workbench engine (Base Image Version 1) includes Python 2.7.11 and Python 3.6.1. To use PySpark with lambda functions that run within the CDH cluster, the Spark executors must have access to a matching version of Python. For many common operating systems, the default system Python will not match the minor release of Python included in Data Science Workbench.
To ensure that the Python versions match, Python can either be installed on every CDH node or made available per job run using Spark’s ability to distribute dependencies. Given the size of a typical isolated Python environment and the desire to avoid repeated uploads from gateway hosts, Cloudera recommends installing Python 2.7 and 3.6 on the cluster if you are using PySpark with lambda functions. You can install Python 2.7 and 3.6 on the cluster using any method and set the corresponding PYSPARK_PYTHON variable in your project.
Anaconda - Continuum Analytics and Cloudera have partnered to create an Anaconda parcel for CDH to enable simple distribution, installation, and management of popular Python packages and their dependencies. The public Anaconda parcel ships Python 2.7.11. Note that the Anaconda parcel is not directly supported by Cloudera and no publicly available parcel exists for Python 3.6. For an example on distributing Python dependencies dynamically, see Example: Distributing Dependencies on a PySpark Cluster.
Docker and Kubernetes Support
- Chrome (latest stable version)
- Firefox (latest released version and latest ESR version)
- Safari 9+
- Internet Explorer (IE) 11+
Recommended Configuration on Amazon Web Services (AWS)
On AWS, Cloudera Data Science Workbench must be used with persistent/long-running Apache Hadoop clusters only.
- For instructions on deploying CDH and Cloudera Manager on AWS, refer the Cloudera AWS Reference Architecture document.
- Use Cloudera Director to orchestrate operations. Use Cloudera Manager to monitor the cluster.
- No security group or network restrictions between hosts.
- HTTP connectivity to corporate network for browser access. There cannot be proxies or manual SSH tunnels.
- Recommended Instance Types
In this case, bigger is better. That is, one m4.16large is better than four m4.4xlarge hosts. AWS pricing scales linearly, and larger instances have more EBS bandwidth.
- 100 GB root volume block device (gp2) on all hosts
- 500 GB Docker block devices (gp2) on all hosts
- 1 TB Application block device (io1) on master host