Cloudera Data Science Workbench Engines
In the context of Cloudera Data Science Workbench, engines are responsible for running data science workloads and intermediating access to the underlying CDH cluster. This topic gives you an overview of engines in Cloudera Data Science Workbench and walks you through some of the ways you can customize engine environments to meet the requirements of your users and projects.
Basic Concepts and Terminology
- Base Engine Image
The base engine image is a Docker image that contains all the building blocks needed to launch a Cloudera Data Science Workbench session and run a workload. It consists of kernels for Python, R, and Scala, along with some additional libraries that can be used to run common data analytics operations. When you launch a session to run a project, an engine is kicked off from a container of this image. The base image itself is built and shipped along with Cloudera Data Science Workbench.
New versions of the base engine image are released sporadically. However, existing projects are not automatically upgraded to use the new engine images. Older images are retained to ensure you are able to test code compatibility with the new engine before upgrading to it manually.
For more details on the libraries shipped within the base engine image, see Cloudera Data Science Workbench Engine Versions and Packaging.
The term engine refers to a virtual machine-style environment that is created when you run a project (via session or job) in Cloudera Data Science Workbench. You can use an engine to run R, Python, and Scala workloads on data stored in the underlying CDH cluster.
Cloudera Data Science Workbench allows you to run code in one of two ways, using either a session or a job. A session is a way to interactively launch an engine and execute code, whereas a job lets you batch process those actions and can be scheduled to run recursively. Each session and job launches its own engine that lives as long as the workload is running (or until it times out).
A running engine will include the following components:
Each engine runs a kernel with either an R, Python or Scala process that can be used to execute code within the engine. The kernel launched differs based on the option (either Python 2/3, PySpark, R, or Scala) you select when you launch the session or configure a job.
The Python kernel is based on the Jupyter IPython kernel, the R kernel has been custom-made for CDSW, and the Scala kernel is based on the Apache Toree kernel.
- Project Filesystem Mount
Cloudera Data Science Workbench uses a persistent filesystem to store project files such as user code, installed libraries, or even small data files. Project files are stored on the master node at /var/lib/cdsw/current/projects.
Every time you launch a new session or run a job for a project, a new engine is created and the project filesystem is mounted into the engine's environment (at /home/cdsw). Once the session/job ends, the only project artifacts that remain are a log of the workload you've run, and any files that were generated or modified, including any libraries you might have installed. All of the installed dependencies will persist through the lifetime of the project. The next time you launch a session/job for the same project, those dependencies will be mounted into the engine environment along with the rest of the project filesystem.
- CDH and Host Mounts
To ensure that each engine is able to access the CDH cluster, a number of folders are mounted from the CDSW gateway host into the engine's environment. For example, on a CSD deployment, this includes the path to the parcel repository, /opt/cloudera, client configurations for HDFS, Spark, YARN, as well as the host’s JAVA_HOME.
Cloudera Data Science Workbench works out-of-the-box for CDH clusters that use the default file system layouts configured by Cloudera Manager. If you have customized your CDH cluster's filesystem layout (for example, modified the CDH parcel directory) or if there are any other files on the hosts that should be mounted into the engines, use the Site Administration panel to include them.
For detailed instructions, see CDH Parcel Directory and Host Mounts.
This section describes how you can configure engine environments to meet the requirements of a project. This can be done by using environmental variables, installing any required dependencies directly to the project, or by creating a custom engine that is built-in with all the project's required dependencies.
Environmental variables help you customize engine environments, both globally and for individual projects/jobs. For example, if you need to configure a particular timezone for a project, or increase the length of the session/job timeout windows, you can use environmental variables to do so. Environmental variables can also be used to assign variable names to secrets such as passwords or authentication tokens to avoid including these directly in the code.
For a list of the environmental variables you can configure and instructions on how to configure them, see Engine Environment Variables.
Mounting Dependencies into Engine Environments
This section describes some of the options available to you for mounting a project's dependencies into its engine environment:
Mounting Additional Dependencies from the Host
As described in the previous sections, all Cloudera Data Science Workbench projects run within an engine. By default, Cloudera Data Science Workbench will automatically mount the CDH parcel directory, and client configuration for required services such as HDFS, Spark, and YARN into the engine. However, if users want to reference any additional files/folders on the host, site administrators will need to explicitly load them into engine containers at runtime.
For instructions, see Configuring Host Mounts. Note that the directories specified here will be available to all projects across the deployment.
Directly Installing Packages within Projects
Cloudera Data Science Workbench engines are preloaded with a few common packages and libraries for R, Python, and Scala. In addition to these, Cloudera Data Science Workbench also allows you to install any other packages or libraries required by your projects, just as you would on your local computer. Each project's environment is completely isolated from others, which means you can install different versions of libraries pinned to different projects.
Libraries can be installed from the workbench, using either the inbuilt interactive command prompt or the terminal. Any dependencies installed this way are mounted to the project environment at /home/cdsw. Alternatively, you could choose to use a package manager such as Conda to install and maintain packages and their dependencies.
For detailed instructions, see Installing Additional Packages.
Creating a Custom Engine with the Required Package(s)
Directly installing a package to a project as described above might not always be feasible. For example, packages that require root access to be installed, or that must be installed to a path outside /home/cdsw (outside the project mount), cannot be installed directly from the workbench. For such circumstances, Cloudera recommends you extend the base Cloudera Data Science Workbench engine image to build a custom image with all the required packages installed to it.
This approach can also be used to accelerate project setup across the deployment. For example, if you want multiple projects on your deployment to have access to some common dependencies (package or software or driver) out of the box, or even if a package just has a complicated setup, it might be easier to simply provide users with an engine environment that has already been customized for their project(s).
For detailed instructions with an example, see Customized Engine Images
Managing Dependencies for Spark 2 Projects
With Spark projects, you can add external packages to Spark executors on startup. To add external dependencies to Spark jobs, specify the libraries you want added by using the appropriate configuration parameters in a spark-defaults.conf file.
For a list of the relevant properties and examples, see Managing Dependencies for Spark 2 Jobs.
Configuring Engine Environments for Experiments and Models
To allow for versioning of experiments and models, Cloudera Data Science Workbench executes each experiment and model in a completely isolated engine. Every time a model or experiment is kicked off, Cloudera Data Science Workbench creates a new isolated Docker image where the model or experiment is executed. These engines are built by extending the project's designated default engine image to include the code to be executed and any dependencies as specified.
For details on how this process works and how to configure these environments, see Engines for Experiments and Models.