Getting Started with Cloudera Data Science Workbench

Signing up

Sign up by opening Cloudera Data Science Workbench console in a web browser. The first time you log in, you are prompted to create a username and password.

If your site administrator has configured your cluster to require invitations, you will need an invitation link to sign up.

(On Secure Clusters) Apache Hadoop Authentication with Kerberos

Cloudera Data Science Workbench users can authenticate themselves using Kerberos against the cluster KDC defined in the host's /etc/krb5.conf file. Cloudera Data Science Workbench does not assume that your Kerberos principal is always the same as your login information. Therefore, you will need to make sure Cloudera Data Science Workbench knows your Kerberos identity when you sign in.

Authenticate against your cluster’s Kerberos KDC by going to the top-right dropdown menu and clicking Account settings > Hadoop Authentication. Once successfully authenticated, Cloudera Data Science Workbench uses your stored keytab to ensure that you are secure when running your workloads.

After you authenticate with Kerberos, Cloudera Data Science Workbench will store your keytab. This keytab is then injected into any running engines so that users are automatically authenticated against the CDH cluster when using an engine. Type klist at the engine terminal, to see your Kerberos principal. You should now be able to connect to Spark, Hive, and Impala without manually running kinit.

If your cluster is not kerberized, your Hadoop username is set to your login username. To override this username you can set an alternative HADOOP_USER_NAME by going to Account settings > Hadoop Authentication.

Create a Project from a Template

Cloudera Data Science Workbench is organized around projects. Projects hold all the code, configuration, and libraries needed to reproducibly run analyses. Each project is independent, ensuring users can work freely without interfering with one another or breaking existing workloads.

To get oriented in Cloudera Data Science Workbench, start by creating a template project in your programming language of choice. Using a template project is not required, and does not limit you to a particular language, but does contain example files to help you get started.

To create a Cloudera Data Science Workbench template project:
  1. Sign in to Cloudera Data Science Workbench.
  2. On the Project Lists page, click New Project.
  3. Enter a Project Name.
  4. In the Template tab, choose a programming language from the pop-up menu.
  5. Click Create Project.

After creating your project, you see your project files and the list of jobs defined in your project. These project files are stored on an internal NFS server, and are available to all your project sessions and jobs, regardless of the gateway nodes they run on. Any changes you make to the code or libraries you install into your project will be immediately available when running an engine.

Start Using the Workbench

The project workbench provides an interactive environment tailored for data science, supporting R, Python and Scala. It currently supports R, Python, and Scala engines. You can use these engines in isolation, as you would on your laptop, or connect to your CDH cluster using Cloudera Distribution of Apache Spark 2 and other libraries.

The workbench includes four primary components:

  • An editor where you can edit your scripts.
  • A console where you can track the results of your analysis.
  • A command prompt where you can enter commands interactively.
  • A terminal where you can use a Bash shell.

Launch a Session

To launch a session:

  1. Click Open Workbench in the project overview.
  2. Use Select Engine Kernel to choose your language.
  3. Use Select Engine Profile to select the number of CPU cores and memory.
  4. Click Launch Session.

The command prompt at the bottom right of your browser window turns green when the engine is ready. Sessions typically take between 10 and 20 seconds to start.

Execute Code

You can enter and execute code at the command prompt or the editor. The editor is best for code you want to keep, while the command prompt is best for quick interactive exploration.

If you want to enter more than one line of code at the command prompt, use Shift-Enter to move to the next line. Press Enter to run your code. The output of your code, including plots, appears in the console.

If you created your project from a template, there are code files in the editor. You can open a file in the editor by double-clicking the file name in the file list.

To run code in the editor:

  1. Select a code file in the list on the left.
  2. Highlight the code you want to run.
  3. Press Ctrl-Enter (Windows/Linux) or Command-Enter (OSX).

When doing real analysis, writing and executing your code from the editor rather than the command prompt makes it easy to iteratively develop your code and save it along the way.

If you require more space for your editor, you can collapse the file list by double-clicking between the file list pane and the editor pane. You can hide the editor using editor's View menu.

Access the Terminal

Cloudera Data Science Workbench provides full terminal access to running engines from the web console. If you run klist you should see your authenticated Kerberos principal. If you run hdfs dfs -ls you will see the files stored in your HDFS home directory. You do not need to worry about Kerberos authentication.

Use the terminal to move files around, run Git commands, access the YARN and Hadoop CLIs, or install libraries that cannot be installed directly from the engine. You can access the Terminal from a running Session page by clicking the Terminal Access tab above the session log pane.

All of your project files are in /home/cdsw. Any modifications you make to this folder will persist across runs, while modifications to other folders are discarded.

By default, the terminal does not provide root or sudo access to the container. To install packages that require root access, see Customizing Engine Images.

Stop a Session

When you are done with the session, click Stop in the menu bar above the console, or use code to exit by typing the following command:

R

quit()

Python

exit

Scala

quit()

Sessions automatically stop after an hour of inactivity.

Next Steps

Now that you have successfully run a sample workload with the Cloudera Data Science Workbench, further acquaint yourself with Cloudera Data Science Workbench by reading the User, Administration, and Security guides to learn more about the types of users, how to collaborate on projects, how to use Spark 2 for advanced analytics, and how to secure your deployment.