Customizing Engine Images

Cloudera Data Science Workbench site administrators and project administrators can add libraries and other dependencies to the Docker image in which their engines run. Currently, Cloudera Data Science Workbench only supports public Docker images in registries accessible to the Cloudera Data Science Workbench nodes.

Site administrators can whitelist images for use in projects, and project administrators can select which of these white-listed images is installed for their projects.

Example: MeCab

The following Dockerfile shows how to add MeCab, a Japanese text tokenizer, to the base Cloudera Data Science Workbench engine.

# Dockerfile

FROM docker.repository.cloudera.com/cdsw/engine:1
RUN apt-get update && \
    apt-get install -y -q mecab \
                          libmecab-dev \
                          mecab-ipadic-utf8 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
RUN cd /tmp && \
    git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git && \
    /tmp/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -y -n -p /var/lib/mecab/dic/neologd && \
    rm -rf /tmp/mecab-ipadic-neologd
RUN pip install --upgrade pip
RUN pip install mecab-python==0.996
To use this image on your Cloudera Data Science Workbench project, perform the following steps.
  1. Build your image with the Dockerfile.
    docker build -t <company-registry>/user/cdsw-mecab:latest . -f Dockerfile
  2. Push the image to your company's Docker registry.
    docker push <company-registry>/user/cdsw-mecab:latest
  3. Whitelist the image, <company-registry>/user/cdsw-mecab:latest. Only a site administrator can do this.
    1. Log in as a site administrator.
    2. Click Admin.
    3. Go to the Engines tab.
    4. Add <company-registry>/user/cdsw-mecab:latest to the list of whitelisted engine images.
  4. Make the whitelisted image available to your project. Only a project administrator can do this.
    1. Go to the project Settings page.
    2. Click Engines.
    3. Select company-registry/user/cdsw-mecab:latest from the dropdown list of available Docker images. Sessions and jobs you run in your project will now have access to this custom image.