Using NVIDIA GPUs for Cloudera Data Science Workbench Projects

Minimum Required Roles: Cloudera Manager Cluster Administrator, CDSW Site Administrator

A GPU is a specialized processor that can be used to accelerate highly parallelized computationally-intensive workloads. Because of their computational power, GPUs have been found to be particularly well-suited to deep learning workloads. Ideally, CPUs and GPUs should be used in tandem for data engineering and data science workloads. A typical machine learning workflow involves data preparation, model training, model scoring, and model fitting. You can use existing general-purpose CPUs for each stage of the workflow, and optionally accelerate the math-intensive steps with the selective application of special-purpose GPUs. For example, GPUs allow you to accelerate model fitting using frameworks such as Tensorflow, PyTorch, Keras, MXNet, and Microsoft Cognitive Toolkit (CNTK).

By enabling GPU support, data scientists can share GPU resources available on Cloudera Data Science Workbench hosts. Users can requests a specific number of GPU instances, up to the total number available on a host, which are then allocated to the running session or job for the duration of the run. Projects can use isolated versions of libraries, and even different CUDA and cuDNN versions via Cloudera Data Science Workbench's extensible engine feature.

Prerequisite

This topic assumes you have already installed or upgraded to the latest version of Cloudera Data Science Workbench.

Key Points to Note

  • Cloudera Data Science Workbench only supports CUDA-enabled NVIDIA GPU cards.

  • Cloudera Data Science Workbench does not support heterogeneous GPU hardware in a single deployment.

  • Cloudera Data Science Workbench does not include an engine image that supports NVIDIA libraries. Create your own custom CUDA-capable engine image using the instructions described in this topic.

  • Cloudera Data Science Workbench does not install or configure the NVIDIA drivers on the Cloudera Data Science Workbench gateway hosts. These depend on your GPU hardware and will have to be installed by your system administrator. The steps provided in this topic are generic guidelines that will help you evaluate your setup.

  • The instructions described in this topic require Internet access. If you have an airgapped deployment, you will be required to manually download and load the resources onto your hosts.

  • For a list of known issues associated with this feature, refer Known Issues - GPU Support.

Enabling Cloudera Data Science Workbench to use GPUs

To enable GPU usage on Cloudera Data Science Workbench, perform the following steps to provision the Cloudera Data Science Workbench hosts. As noted in the following instructions, certain steps must be repeated on all gateway hosts that have GPU hardware installed on them.

The steps described in this document have been tested and validated on the following setup:
CDSW OS & Kernel NVIDIA Driver CUDA
1.6.x (engine 8)

RHEL 7.4

3.10.0-862.9.1.el7.x86_64

418.56 CUDA 10.0
1.6.x (engine 8)

RHEL 7.6

3.10.0-957.12.2.el7.x86_64

418.56 CUDA 10.0

Set Up the Operating System and Kernel

Perform this step on all hosts with GPU hardware installed on them.

  1. Install the kernel-devel package.

    sudo yum install -y kernel-devel-`uname -r`

    If the previous command fails to find a matching version of the kernel-devel package, list all the kernel/kernel-devel versions that are available from the RHEL/CentOS package repositories, and pick the desired version to install.

    You can use a bash script as demonstrated here to do this:
    if ! yum install kernel-devel-`uname -r`; then 
      yum install -y kernel kernel-devel; retValue=$?
      if [ $retValue -eq 0]; then echo "Reboot is required since new version of kernel was installed"; fi
    fi
  2. If you upgraded to a new kernel version in the previous step, run the following command to reboot.
    sudo reboot
  3. Install the Development tools package.
    sudo yum groupinstall -y "Development tools"

Install the NVIDIA Driver on GPU Hosts

Perform this step on all hosts with GPU hardware installed on them.

Cloudera Data Science Workbench does not ship with any of the NVIDIA drivers needed to enable GPUs for general purpose processing. System administrators are expected to install the version of the drivers that are compatible with the CUDA libraries that will be consumed on each host.

Use the NVIDIA UNIX Driver archive to find out which driver is compatible with your GPU card and operating system. To download and install the NVIDIA driver, make sure you follow the instructions on the respective driver's download page.. It is crucial that you download the correct version.

For example, if you use the .run file method (Linux 64 bit), you would download and install the driver as follows:
wget http://us.download.nvidia.com/.../NVIDIA-Linux-x86_64-<driver_version>.run
export NVIDIA_DRIVER_VERSION=<driver_version>
chmod 755 ./NVIDIA-Linux-x86_64-$NVIDIA_DRIVER_VERSION.run
./NVIDIA-Linux-x86_64-$NVIDIA_DRIVER_VERSION.run -asq
Once the installation is complete, run the following command to verify that the driver was installed correctly:
/usr/bin/nvidia-smi

Enable GPU Support in Cloudera Data Science Workbench

Minimum Required Cloudera Manager Role: Cluster Administrator

Depending on your deployment, use one of the following sets of steps to enable Cloudera Data Science Workbench to identify the GPUs installed:

CSD Deployments

  1. Go to the CDSW service in Cloudera Manager. Click Configuration. Search for the following property and enable it:

    Enable GPU Support

    Use the checkbox to enable GPU support for Cloudera Data Science Workbench workloads. When this property is enabled on a host that is equipped with GPU hardware, the GPU(s) will be available for use by Cloudera Data Science Workbench.

  2. Restart the CDSW service in Cloudera Manager.
  3. Test whether Cloudera Data Science Workbench is detecting GPUs.

RPM Deployments

  1. Set the following parameter in /etc/cdsw/config/cdsw.conf on all Cloudera Data Science Workbench hosts. You must make sure that cdsw.conf is consistent across all hosts, irrespective of whether they have GPU hardware installed on them.

    NVIDIA_GPU_ENABLE

    Set this property to true to enable GPU support for Cloudera Data Science Workbench workloads. When this property is enabled on a host that is equipped with GPU hardware, the GPU(s) will be available for use by Cloudera Data Science Workbench.

  2. On the master host, run the following command to restart Cloudera Data Science Workbench.
    cdsw restart
    If you modified cdsw.conf on a worker host, run the following commands to make sure the changes go into effect:
    cdsw stop
    cdsw join
  3. Use the following section to test whether Cloudera Data Science Workbench can now detect GPUs.

Test whether Cloudera Data Science Workbench can detect GPUs

Once Cloudera Data Science Workbench has successfully restarted, if NVIDIA drivers have been installed on the Cloudera Data Science Workbench hosts, Cloudera Data Science Workbench will now be able to detect the GPUs available on its hosts.

Additionally, the output of this command will also indicate that there are hosts with GPUs present.
cdsw status

Create a Custom CUDA-capable Engine Image

The base engine image (docker.repository.cloudera.com/cdsw/engine:<version>) that ships with Cloudera Data Science Workbench will need to be extended with CUDA libraries to make it possible to use GPUs in jobs and sessions.

The following sample Dockerfile illustrates an engine on top of which machine learning frameworks such as Tensorflow and PyTorch can be used. This Dockerfile uses a deep learning library from NVIDIA called NVIDIA CUDA Deep Neural Network (cuDNN). For detailed information about compatibility between NVIDIA driver versions and CUDA, refer the cuDNN installation guide (prerequisites).

Make sure you also check with the machine learning framework that you intend to use in order to know which version of cuDNN is needed. As an example, Tensorflow's NVIDIA hardware and software requirements for GPU support are listed in the Tensorflow documentation here. Additionally, the Tensorflow version compatibility matrix for CUDA and cuDNN is documented here..

The following sample Dockerfile uses NVIDIA's official Dockerfiles for CUDA and cuDNN images.

cuda.Dockerfile

FROM  docker.repository.cloudera.com/cdsw/engine:8

RUN NVIDIA_GPGKEY_SUM=d1be581509378368edeec8c1eb2958702feedf3bc3d17011adbf24efacce4ab5 && \
    NVIDIA_GPGKEY_FPR=ae09fe4bbd223a84b2ccfce3f60f4b3d7fa2af80 && \
    apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub && \
    apt-key adv --export --no-emit-version -a $NVIDIA_GPGKEY_FPR | tail -n +5 > cudasign.pub && \
    echo "$NVIDIA_GPGKEY_SUM  cudasign.pub" | sha256sum -c --strict - && rm cudasign.pub && \
    echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/cuda.list

ENV CUDA_VERSION 10.0.130
LABEL com.nvidia.cuda.version="${CUDA_VERSION}"

ENV CUDA_PKG_VERSION 10-0=$CUDA_VERSION-1
RUN apt-get update && apt-get install -y --no-install-recommends \
        cuda-cudart-$CUDA_PKG_VERSION && \
    ln -s cuda-10.0 /usr/local/cuda && \
    rm -rf /var/lib/apt/lists/*

RUN echo "/usr/local/cuda/lib64" >> /etc/ld.so.conf.d/cuda.conf && \
    ldconfig

RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
    echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf

ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64

RUN echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list

ENV CUDNN_VERSION 7.5.1.10
LABEL com.nvidia.cudnn.version="${CUDNN_VERSION}"

RUN apt-get update && apt-get install -y --no-install-recommends \
            libcudnn7=$CUDNN_VERSION-1+cuda10.0 && \
    apt-mark hold libcudnn7 && \
    rm -rf /var/lib/apt/lists/*
You can now build a custom engine image out of cuda.Dockerfile using the following sample command:
docker build --network host -t <company-registry>/cdsw-cuda:8 . -f cuda.Dockerfile

Push this new engine image to a public Docker registry so that it can be made available for Cloudera Data Science Workbench workloads. For example:

docker push <company-registry>/cdsw-cuda:8

Site Admins: Add the Custom CUDA Engine to your Cloudera Data Science Workbench Deployment

Required CDSW Role: Site Administrator

After you've created the custom CUDA engine, a site administrator must add this new engine to Cloudera Data Science Workbench.
  1. Sign in to Cloudera Data Science Workbench.
  2. Click Admin.
  3. Go to the Engines tab.
  4. Under Engine Images, add the custom CUDA-capable engine image created in the previous step. This allows project administrators across the deployment to start using this engine in their jobs and sessions.
  5. Site administrators can also set a limit on the maximum number of GPUs that can be allocated per session or job. From the Maximum GPUs per Session/Job dropdown, select the maximum number of GPUs that can be used by an engine.
  6. Click Update.

Project Admins: Enable the Custom CUDA Engine for your Project

Project administrators can use the following steps to make it the CUDA engine the default engine used for workloads within a particular project.

  1. Navigate to your project's Overview page.
  2. Click Settings.
  3. Go to the Engines tab.
  4. Under Engine Image, select the CUDA-capable engine image from the dropdown.

Test the Custom CUDA Engine

You can use the following simple examples to test whether the new CUDA engine is able to leverage GPUs as expected.

  1. Go to a project that is using the CUDA engine and click Open Workbench.
  2. Launch a new session with GPUs.
  3. Run the following command in the workbench command prompt to verify that the driver was installed correctly:
    ! /usr/bin/nvidia-smi
  4. Use any of the following code samples to confirm that the new engine works with common deep learning libraries.

    Pytorch

    !pip3 install torch
    from torch import cuda
    assert cuda.is_available()
    assert cuda.device_count() > 0
    print(cuda.get_device_name(cuda.current_device()))

    Tensorflow

    !pip3 install tensorflow-gpu==1.13.1
    from tensorflow.python.client import device_lib
    assert 'GPU' in str(device_lib.list_local_devices())
    device_lib.list_local_devices()

    Keras

    !pip3 install keras
    from keras import backend
    assert len(backend.tensorflow_backend._get_available_gpus()) > 0
    print(backend.tensorflow_backend._get_available_gpus())