Cloudera Data Science Workbench User Guide

As a Cloudera Data Science Workbench user, you can create and run data science workloads, either individually or in teams. Cloudera Data Science Workbench uses the notion of contexts to separate your personal account from any team accounts you belong to. Depending on the context you are in, you will be able to modify settings for either your personal account, or a team account, and see the projects created in each account. Shared personal projects will show up in your personal account context. Context changes in the UI are subtle, so if you're wondering where a project or setting lives, first make sure you are in the right context.

The application header will tell you which context you are currently in. You can switch to a different context by going to the drop-down menu in the upper right-hand corner of the page.



The rest of this topic features instructions for some common tasks a Cloudera Data Science Workbench user can be expected to perform.

Managing your Personal Account

To manage your personal account settings:

  1. Sign in to Cloudera Data Science Workbench.
  2. From the upper right drop-down menu, switch context to your personal account.
  3. Click the Settings tab.
    Profile
    You can modify your name, email, and bio on this page.
    Teams
    This page lists the teams you are a part of and the role assigned to you for each team.
    SSH Keys
    Your public SSH key resides here. SSH keys provide a useful way to access to external resources such as databases or remote Git repositories. For instructions, see SSH Keys.
    Hadoop Authentication
    Enter your Hadoop credentials here to authenticate yourself against the cluster KDC. For more information, see Hadoop Authentication with Kerberos for Cloudera Data Science Workbench.

Managing Team Accounts

Users who work together on more than one project and want to facilitate collaboration can create a Team. Teams allow streamlined administration of projects. Team projects are owned by the team, rather than an individual user. Team administrators can add or remove members at any time, assigning each member different permissions.

Creating a Team

To create a team:

  1. Click the plus sign (+) in the title bar, to the right of the Search field.
  2. Select Create Team.
  3. Enter a Team Name.
  4. Click Create Team.
  5. Add or invite team members. Team members can have one of the following privilege levels:
    • Viewer - Cannot create new projects within the team but can be added to existing ones
    • Contributor - Can create new projects within the team. They can also be added to existing team projects.
    • Admin - Has complete access to all team projects, and account and billing information.
  6. Click Done.

Modifying Team Account Settings

Team administrators can modify account information, add or invite new team members, and view/edit privileges of existing members. To make these changes:
  1. From the upper right drop-down menu, switch context to the team account.
  2. Click the Settings tab to open up the Account Settings dashboard.
    Profile
    Modify the team description on this page.
    Members
    You can add new team members on this page, and modify privilege levels for existing members.
    SSH Keys
    The team's public SSH key resides here. Team SSH keys provide a useful way to give an entire team access to external resources such as databases. For instructions, see SSH Keys. Generally, team SSH keys should not be used to authenticate against Git repositories. Use your personal key instead.

Managing Projects

Projects form the heart of Cloudera Science Science Workbench. They hold all the code, configuration, and libraries needed to reproducibly run analyses. Each project is independent, ensuring users can work freely without interfering with one another or breaking existing workloads.

Creating a Project

To create a Cloudera Data Science Workbench project:
  1. Go to the Projects tab.
  2. Click New Project.
  3. If you are a member of a team, from the drop-down menu, select the Account under which you want to create this project. If there is only one account on the deployment, you will not see this option.
  4. Enter a Project Name.
  5. Select Project Visibility from one of the following options.
    • Private - Only project collaborators can view or edit the project.
    • Team - If the project is created under a team account, all members of the team can view the project. Only explicitly-added collaborators can edit the project.
    • Public - All authenticated users of Cloudera Data Science Workbench will be able to view the project. Collaborators will be able to edit the project.
  6. Under Initial Setup, you can either create a blank project, or select one of the following sources for your project files.
    • Template - Template projects contain example code that can help you get started with the Cloudera Data Science Workbench. They are available in R, Python, PySpark, and Scala. Using a template project is not required, but it does give you the impetus to start using the Cloudera Data Science Workbench right away.
    • Local - If you have an existing project on your local disk, use this option to upload compressed file or folder to Cloudera Data Science Workbench.
    • Git - If you already use Git for version control and collaboration, you can continue to do so with the Cloudera Data Science Workbench. Specifying a Git URL will clone the project into Cloudera Data Science Workbench. If you use a Git SSH URL, your personal private SSH key will be used to clone the repository. This is the recommended approach. However, you must add the public SSH key from your personal Cloudera Data Science Workbench account to the remote Git hosting service before you can clone the project.
  7. Click Create Project. After the project is created, you can see your project files and the list of jobs defined in your project.
  8. (Optional) To work with team members on a project, use the instructions in the following section to add them as collaborators to the project.

Adding Project Collaborators

If you want to work closely with colleagues on a particular project, use the following steps to add them to the project.
  1. Navigate to the project overview page.
  2. Click Team to open the Collaborators tab.
  3. Search for collaborators by either name or email address and click Add.

    For a project created under your personal account, anyone who belongs to your organization can be added as a collaborator. For a project created under a team account, you can only add collaborators that already belong to the team. If you want to work on a project that requires collaborators from different teams, create a new team with the required members, then create a project under that account. If your project was created from a Git repository, each collaborator will have to create the project from the same central Git repository.

    You can grant collaborators one of three levels of access:
    • Viewer - Can view code, data, and results.
    • Contributor: Can view, edit, create, and delete files and environmental variables, run jobs and execute code in running jobs.
    • Admin: This user has complete access to all aspects of the project, including adding new collaborators, and deleting the entire project.

For more information on collaborating effectively, see Sharing Projects and Analysis Results.

Modifying Project Settings

Project contributors and administrators can modify aspects of the project environment such as the engine being used to launch sessions, the environment variables, and create SSH tunnels to access external resources. To make these changes:
  1. Switch context to the account where the project was created.
  2. Click the Projects tab.
  3. From the list of projects, select the one you want to modify.
  4. Click the Settings tab to open up the Project Settings dashboard.
    Options
    Modify the project name and its privacy settings on this page.
    Engine
    Cloudera Data Science Workbench ensures that your code is always run with the specific engine version you selected. You can select the version here. For advanced use cases, Cloudera Data Science Workbench projects can use custom Docker images for their projects. Site administrators can whitelist images for use in projects, and project administrators can use this page to select which of these whitelisted images is installed for their projects. For an example, see Customizing Engine Images.

    Environment - If there are any environmental variables that should be injected into all the engines running this project, you can add them to this page. For more details, see Project Environment Variables.

    Tunnels
    In some environments, external databases and data sources reside behind restrictive firewalls. Cloudera Data Science Workbench provides a convenient way to connect to such resources using your SSH key. For instructions, see SSH Tunnels.
    Git
    This page lists a webhook that can be added to your Git configuration to ensure that your project files are updated with the latest changes from the remote repository.
    Delete Project
    This page can only be accessed by project administrators. Remember that deleting a project is irreversible. All files, data, sessions, and jobs will be lost.

Using the Workbench

The project workbench provides an interactive environment tailored for data science, supporting R, Python and Scala. It currently supports R, Python, and Scala engines. You can use these engines in isolation, as you would on your laptop, or connect to your CDH cluster using Cloudera Distribution of Apache Spark 2 and other libraries.

The workbench includes four primary components:

  • An editor where you can edit your scripts.
  • A console where you can track the results of your analysis.
  • A command prompt where you can enter commands interactively.
  • A terminal where you can use a Bash shell.

Launch a Session

To launch a session:

  1. Click Open Workbench in the project overview.
  2. Use Select Engine Kernel to choose your language.
  3. Use Select Engine Profile to select the number of CPU cores and memory.
  4. Click Launch Session.

The command prompt at the bottom right of your browser window turns green when the engine is ready. Sessions typically take between 10 and 20 seconds to start.

Execute Code

You can enter and execute code at the command prompt or the editor. The editor is best for code you want to keep, while the command prompt is best for quick interactive exploration.

If you want to enter more than one line of code at the command prompt, use Shift-Enter to move to the next line. Press Enter to run your code. The output of your code, including plots, appears in the console.

If you created your project from a template, there are code files in the editor. You can open a file in the editor by double-clicking the file name in the file list.

To run code in the editor:

  1. Select a code file in the list on the left.
  2. Highlight the code you want to run.
  3. Press Ctrl-Enter (Windows/Linux) or Command-Enter (OSX).

When doing real analysis, writing and executing your code from the editor rather than the command prompt makes it easy to iteratively develop your code and save it along the way.

If you require more space for your editor, you can collapse the file list by double-clicking between the file list pane and the editor pane. You can hide the editor using editor's View menu.

Access the Terminal

Cloudera Data Science Workbench provides full terminal access to running engines from the web console. If you run klist you should see your authenticated Kerberos principal. If you run hdfs dfs -ls you will see the files stored in your HDFS home directory. You do not need to worry about Kerberos authentication.

Use the terminal to move files around, run Git commands, access the YARN and Hadoop CLIs, or install libraries that cannot be installed directly from the engine. You can access the Terminal from a running Session page by clicking the Terminal Access tab above the session log pane.

All of your project files are in /home/cdsw. Any modifications you make to this folder will persist across runs, while modifications to other folders are discarded.

By default, the terminal does not provide root or sudo access to the container. To install packages that require root access, see Customizing Engine Images.

Stop a Session

When you are done with the session, click Stop in the menu bar above the console, or use code to exit by typing the following command:

R

quit()

Python

exit

Scala

quit()

Sessions automatically stop after an hour of inactivity.

Accessing Cloudera Manager and Hue from Cloudera Data Science Workbench

Cloudera Data Science Workbench gives you a way to access your cluster's Cloudera Manager and Hue web UIs from within the Cloudera Data Science Workbench application.. To access these applications, click in the upper right hand corner of the Cloudera Data Science Workbench web application, and select the UI you want to visit from the dropdown menu.