Using NVIDIA GPUs for Cloudera Data Science Workbench Projects

Minimum Required Roles: Cloudera Manager Cluster Administrator, CDSW Site Administrator

A GPU is a specialized processor that can be used to accelerate highly parallelized computationally-intensive workloads. Because of their computational power, GPUs have been found to be particularly well-suited to deep learning workloads. Ideally, CPUs and GPUs should be used in tandem for data engineering and data science workloads. A typical machine learning workflow involves data preparation, model training, model scoring, and model fitting. You can use existing general-purpose CPUs for each stage of the workflow, and optionally accelerate the math-intensive steps with the selective application of special-purpose GPUs. For example, GPUs allow you to accelerate model fitting using frameworks such as Tensorflow, PyTorch, Keras, MXNet, and Microsoft Cognitive Toolkit (CNTK).

By enabling GPU support, data scientists can share GPU resources available on Cloudera Data Science Workbench nodes. Users can requests a specific number of GPU instances, up to the total number available on a node, which are then allocated to the running session or job for the duration of the run. Projects can use isolated versions of libraries, and even different CUDA and cuDNN versions via Cloudera Data Science Workbench's extensible engine feature.

Prerequisite

GPU support for workloads was added in Cloudera Data Science Workbench 1.1.0. The rest of this topic assumes you have already installed or upgraded to the latest version of Cloudera Data Science Workbench.

Key Points to Note

  • Cloudera Data Science Workbench only supports CUDA-enabled NVIDIA GPU cards.

  • Cloudera Data Science Workbench does not support heterogeneous GPU hardware in a single deployment.

  • Cloudera Data Science Workbench does not include an engine image that supports NVIDIA libraries. Create your own custom CUDA-capable engine image using the instructions described in this topic.

  • Cloudera Data Science Workbench does not install or configure the NVIDIA drivers on the Cloudera Data Science Workbench gateway nodes. These depend on your GPU hardware and will have to be installed by your system administrator. The steps provided in this topic are generic guidelines that will help you evaluate your setup.

  • The instructions described in this topic require Internet access. If you have an airgapped deployment, you will be required to manually download and load the resources onto your nodes.

  • For a list of known issues associated with this feature, refer Known Issues - GPU Support.

Enabling Cloudera Data Science Workbench to use GPUs

To enable GPU usage on Cloudera Data Science Workbench, perform the following steps to provision the Cloudera Data Science Workbench hosts. As noted in the following instructions, certain steps must be repeated on all gateway nodes that have GPU hardware installed on them.

The steps described in this document have been tested and validated on the following setup:
CDSW OS & Kernel NVIDIA Driver nvidia-docker CUDA Tensorflow
1.4.x (engine 5)

RHEL 7.4

3.10.0-862.el7.x86_64

396.26 nvidia-docker-1.0.1-1 CUDA 9.2 1.8.0

Set Up the Operating System and Kernel

Perform this step on all nodes with GPU hardware installed on them.

  1. Install the kernel-devel package.

    sudo yum install -y kernel-devel-`uname -r`

    If the previous command fails to find a matching version of the kernel-devel package, list all the kernel/kernel-devel versions that are available from the RHEL/CentOS package repositories, and pick the desired version to install.

    You can use a bash script as demonstrated here to do this:
    if ! yum install kernel-devel-`uname -r`; then 
      yum install -y kernel kernel-devel; retValue=$?
      if [ $retValue -eq 0]; then echo "Reboot is required since new version of kernel was installed"; fi
    fi
  2. If you upgraded to a new kernel version in the previous step, run the following command to reboot.
    sudo reboot
  3. Install the Development tools package.
    sudo yum groupinstall -y "Development tools"

Install the NVIDIA Driver on GPU Nodes

Perform this step on all nodes with GPU hardware installed on them.

Cloudera Data Science Workbench does not ship with any of the NVIDIA drivers needed to enable GPUs for general purpose processing. System administrators are expected to install the version of the drivers that are compatible with the CUDA libraries that will be consumed on each node.

Use the NVIDIA UNIX Driver archive to find out which driver is compatible with your GPU card and operating system. Make sure the driver you select is also compatible with the nvidia-docker plugin we will be installing in the next step. See nvidia-docker installation prerequisites.

To download and install the NVIDIA driver, make sure you follow the instructions on the respective driver's download page.. It is crucial that you download the correct version.

For example, if you use the .run file method (Linux 64 bit), you would download and install the driver as follows:
wget http://us.download.nvidia.com/.../NVIDIA-Linux-x86_64-<driver_version>.run
export NVIDIA_DRIVER_VERSION=<driver_version>
chmod 755 ./NVIDIA-Linux-x86_64-$NVIDIA_DRIVER_VERSION.run
./NVIDIA-Linux-x86_64-$NVIDIA_DRIVER_VERSION.run -asq
Once the installation is complete, run the following command to verify that the driver was installed correctly:
/usr/bin/nvidia-smi

Enable Docker NVIDIA Volumes on GPU Nodes

Perform this step on all nodes with GPU hardware installed on them.

To enable Docker containers to use the GPUs, the previously installed NVIDIA driver libraries must be consolidated in a single directory named after the <driver_version>, and mounted into the containers. This is done using the nvidia-docker package, which is a thin wrapper around the Docker CLI and a Docker plugin.

The following sample steps demonstrate how to use nvidia-docker to set up the directory structure for the drivers so that they can be easily consumed by the Docker containers that will leverage the GPU. Perform these steps on all nodes with GPU hardware installed on them.
  1. Download and install nvidia-docker. Use a version that is suitable for your deployment.
    wget https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm
    sudo yum install -y nvidia-docker-1.0.1-1.x86_64.rpm
  2. Start the necessary services and plugins:
    systemctl start nvidia-docker
    systemctl enable nvidia-docker
  3. Run a small container to create the Docker volume structure:
    sudo nvidia-docker run --rm nvidia/cuda:9.2-devel-ubuntu16.04 nvidia-smi
  4. Verify that the /var/lib/nvidia-docker/volumes/nvidia_driver/$NVIDIA_DRIVER_VERSION/ directory was created.
  5. Use the following Docker command to verify that Cloudera Data Science Workbench can access the GPU.
    sudo docker run --net host \
        --device=/dev/nvidiactl \
        --device=/dev/nvidia-uvm \
        --device=/dev/nvidia0 \
        -v /var/lib/nvidia-docker/volumes/nvidia_driver/$NVIDIA_DRIVER_VERSION/:/usr/local/nvidia/ \
        -it nvidia/cuda:9.2-devel-ubuntu16.04 \
        /usr/local/nvidia/bin/nvidia-smi

    On a multi-GPU machine the output of this command will show exactly one GPU. This is because we have run this sample Docker container with only one device (/dev/nvidia0).

Enable GPU Support in Cloudera Data Science Workbench

Minimum Required Cloudera Manager Role: Cluster Administrator

Depending on your deployment, use one of the following sets of steps to enable Cloudera Data Science Workbench to identify the GPUs installed:

CSD Deployments

  1. Go to the CDSW service in Cloudera Manager. Click Configuration. Set the following parameters as directed in the following table.

    Enable GPU Support

    Use the checkbox to enable GPU support for Cloudera Data Science Workbench workloads. When this property is enabled on a node that is equipped with GPU hardware, the GPU(s) will be available for use by Cloudera Data Science Workbench.

    NVIDIA GPU Driver Library Path

    Complete path to the NVIDIA driver libraries. In this example, the path would be, /var/lib/nvidia-docker/volumes/nvidia_driver/$NVIDIA_DRIVER_VERSION/

  2. Restart the CDSW service in Cloudera Manager.
  3. Test whether Cloudera Data Science Workbench is detecting GPUs.

RPM Deployments

  1. Set the following parameters in /etc/cdsw/config/cdsw.conf on all Cloudera Data Science Workbench nodes. You must make sure that cdsw.conf is consistent across all nodes, irrespective of whether they have GPU hardware installed on them.

    NVIDIA_GPU_ENABLE

    Set this property to true to enable GPU support for Cloudera Data Science Workbench workloads. When this property is enabled on a node that is equipped with GPU hardware, the GPU(s) will be available for use by Cloudera Data Science Workbench.

    NVIDIA_LIBRARY_PATH

    Complete path to the NVIDIA driver libraries. In this example, the path would be, "/var/lib/nvidia-docker/volumes/nvidia_driver/$NVIDIA_DRIVER_VERSION/"

  2. On the master node, run the following command to restart Cloudera Data Science Workbench.
    cdsw restart
    If you modified cdsw.conf on a worker node, run the following commands to make sure the changes go into effect:
    cdsw reset
    cdsw join
  3. Use the following section to test whether Cloudera Data Science Workbench can now detect GPUs.

Test whether Cloudera Data Science Workbench can detect GPUs

Once Cloudera Data Science Workbench has successfully restarted, if NVIDIA drivers have been installed on the Cloudera Data Science Workbench hosts, Cloudera Data Science Workbench will now be able to detect the GPUs available on its hosts.

Additionally, the output of this command will also indicate that there are nodes with GPUs present.
cdsw status

Create a Custom CUDA-capable Engine Image

The base engine image (docker.repository.cloudera.com/cdsw/engine:<version>) that ships with Cloudera Data Science Workbench will need to be extended with CUDA libraries to make it possible to use GPUs in jobs and sessions.

The following sample Dockerfile illustrates an engine on top of which machine learning frameworks such as Tensorflow and PyTorch can be used. This Dockerfile uses a deep learning library from NVIDIA called NVIDIA CUDA Deep Neural Network (cuDNN). For detailed information about compatibility between NVIDIA driver versions and CUDA, refer the cuDNN installation guide (prerequisites).

Make sure you also check with the machine learning framework that you intend to use in order to know which version of cuDNN is needed. As an example, Tensorflow's NVIDIA hardware and software requirements for GPU support are listed in the Tensorflow documentation here. Additionally, the Tensorflow version compatibility matrix for CUDA and cuDNN is documented here..

The following sample Dockerfile uses NVIDIA's official Dockerfiles for CUDA and cuDNN images.

cuda.Dockerfile

FROM  docker.repository.cloudera.com/cdsw/engine:5

RUN NVIDIA_GPGKEY_SUM=d1be581509378368edeec8c1eb2958702feedf3bc3d17011adbf24efacce4ab5 && \
    NVIDIA_GPGKEY_FPR=ae09fe4bbd223a84b2ccfce3f60f4b3d7fa2af80 && \
    apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub && \
    apt-key adv --export --no-emit-version -a $NVIDIA_GPGKEY_FPR | tail -n +5 > cudasign.pub && \
    echo "$NVIDIA_GPGKEY_SUM  cudasign.pub" | sha256sum -c --strict - && rm cudasign.pub && \
    echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/cuda.list

ENV CUDA_VERSION 9.2.148
LABEL com.nvidia.cuda.version="${CUDA_VERSION}"

ENV CUDA_PKG_VERSION 9-2=$CUDA_VERSION-1
RUN apt-get update && apt-get install -y --no-install-recommends \
        cuda-cudart-$CUDA_PKG_VERSION && \
    ln -s cuda-9.2 /usr/local/cuda && \
    rm -rf /var/lib/apt/lists/*

RUN echo "/usr/local/cuda/lib64" >> /etc/ld.so.conf.d/cuda.conf && \
    ldconfig

RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
    echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf

ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64

RUN echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list

ENV CUDNN_VERSION 7.2.1.38
LABEL com.nvidia.cudnn.version="${CUDNN_VERSION}"

RUN apt-get update && apt-get install -y --no-install-recommends \
            libcudnn7=$CUDNN_VERSION-1+cuda9.2 && \
    apt-mark hold libcudnn7 && \
    rm -rf /var/lib/apt/lists/*
You can now build a custom engine image out of cuda.Dockerfile using the following sample command:
docker build --network host -t <company-registry>/cdsw-cuda:5 . -f cuda.Dockerfile

Push this new engine image to a public Docker registry so that it can be made available for Cloudera Data Science Workbench workloads. For example:

docker push <company-registry>/cdsw-cuda:5

Allocate GPUs for Sessions and Jobs

Required CDSW Role: Site Administrator

Once Cloudera Data Science Workbench has been enabled to use GPUs, a site administrator must whitelist the CUDA-capable engine image created in the previous step. Site administrators can also set a limit on the maximum number of GPUs that can be allocated per session or job.
  1. Sign in to Cloudera Data Science Workbench.
  2. Click Admin.
  3. Go to the Engines tab.
  4. From the Maximum GPUs per Session/Job dropdown, select the maximum number of GPUs that can be used by an engine.
  5. Under Engine Images, add the custom CUDA-capable engine image created in the previous step. This whitelists the image and allows project administrators to use the engine in their jobs and sessions.
  6. Click Update.
Project administrators can now whitelist the CUDA engine image to make it available for sessions and jobs within a particular project.
  1. Navigate to the project's Overview page.
  2. Click Settings.
  3. Go to the Engines tab.
  4. Under Engine Image, select the CUDA-capable engine image from the dropdown.

Example: Tensorflow

This is a simple example that walks you through a simple Tensorflow workload that uses GPUs.

  1. Open the workbench console and start a Python session. Make sure you select at least 1 GPU from the Select Engine Profile dropdown before you launch the session.
  2. Install Tensorflow.

    Python 2
    !pip install tensorflow-gpu==1.8.0
    Python 3
    !pip3 install tensorflow-gpu==1.8.0
  3. Restart the session. This requirement of a session restart after installation is a known issue specific to Tensorflow.
  4. Create a new file with the following sample code. The code first performs a multiplication operation and prints the session output, which should mention the GPU that was used for the computation. The latter half of the example prints a list of all available GPUs for this engine.
    import tensorflow as tf
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
    
    # Creates a session with log_device_placement set to True.
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    
    # Runs the operation.
    print(sess.run(c))
    
    # Prints a list of GPUs available 
    from tensorflow.python.client import device_lib
    def get_available_gpus():
        local_device_protos = device_lib.list_local_devices()
        return [x.name for x in local_device_protos if x.device_type == 'GPU']
    
    print get_available_gpus()