Using the Workbench
The workbench console provides an interactive environment tailored for data science, supporting R, Python and Scala. It currently supports R, Python, and Scala engines. You can use these engines in isolation, as you would on your laptop, or connect to your CDH cluster using Cloudera Distribution of Apache Spark 2 and other libraries.
The workbench UI includes four primary components:
- An editor where you can edit your scripts.
- A console where you can track the results of your analysis.
- A command prompt where you can enter commands interactively.
- A terminal where you can use a Bash shell.
Launch a Session
- Navigate to your project's Overview page.
Click Open Workbench.
Launch a New Session
- Use Select Engine Kernel to choose the programming language that your project uses.
- Use Select Engine Profile to select the number of CPU cores and memory to be used.
- Click Launch Session.
The command prompt at the bottom right of your browser window will turn green when the engine is ready. Sessions typically take between 10 and 20 seconds to start.
You can enter and execute code at the command prompt or the editor. The editor is best for code you want to keep, while the command prompt is best for quick interactive exploration.
Command Prompt - The command prompt functions largely like any other. Enter a command and press Enter to execute it. If you want to enter more than one line of code, use Shift+Enter to move to the next line. The output of your code, including plots, appears in the console.
If you created your project from a template, you should see project files in the editor. You can open a file in the editor by clicking the file name in the file navigation bar on the left.
Editor - To run code from the editor:
- Select a script from the project files on the left sidebar.
- To run the whole script click on the top navigation bar, or, highlight the code you want to run and press Ctrl+Enter (Windows/Linux) or cmd+Enter (macOS).
When doing real analysis, writing and executing your code from the editor rather than the command prompt makes it easy to iteratively develop your code and save it along the way.
If you require more space for your editor, you can collapse the file list by double-clicking between the file list pane and the editor pane. You can hide the editor using editor's View menu.
The Python and R kernels include support for automatic code completion, both in the editor and the command prompt. Use single tab to display suggestions and double tab for autocomplete.
Project Code Files
All project files are stored to persistent storage within the respective project directory at /var/lib/cdsw/current/projects. They can be accessed within the project just as you would in a typical directory structure. For example, you can import functions from one file to another within the same project.
Access the Terminal
Cloudera Data Science Workbench provides full terminal access to running engines from the web console. You can use the terminal to move files around, run Git commands, access the YARN and Hadoop CLIs, or install libraries that cannot be installed directly from the engine. To access the Terminal from a running session, click Terminal Access above the session log pane.
The terminal's default working directory is /home/cdsw, which is where all your project files are stored. Any modifications you make to this folder will persist across runs, while modifications to other folders are discarded.
If you are using Kerberos authentication, you can run klist to see your Kerberos principal. If you run hdfs dfs -ls you will see the files stored in your HDFS home directory.
Note that the terminal does not provide root or sudo access to the container. To install packages that require root access, see Customizing Engine Images.
Stop a Session
When you are done with the session, click Stop in the menu bar above the console, or use code to exit by typing the following command:
Sessions automatically stop after an hour of inactivity.