As of January 31, 2021, this tutorial references legacy products that no longer represent Cloudera’s current product offerings.
Please visit recommended tutorials:
- How to Create a CDP Private Cloud Base Development Cluster
- All Cloudera Data Platform (CDP) related tutorials
In this tutorial, we will provide an overview of Apache Spark, it's relationship with Scala, Zeppelin notebooks, Interpreters, Datasets and DataFrames. Finally, we will showcase Apache Zeppelin notebook for our development environment to keep things simple and elegant.
Zeppelin will allow us to run in a pre-configured environment and execute code written for Spark in Scala and SQL, a few basic Shell commands, pre-written Markdown directions, and an HTML formatted table.
To make things fun and interesting, we will introduce a film series dataset from the Silicon Valley Comedy TV show and perform some basic operations with Spark in Zeppelin.
Downloaded and deployed the Hortonworks Data Platform (HDP) Sandbox
- You will need 10GB of memory dedicated for the virtual machine, meaning that you should have at least 16GB of memory on your system.
Basic Scala syntax
Binded Zeppelin's Shell Interpreter
- Apache Spark in 5 Minutes Notebook Overview
- Import the Apache Spark in 5 Minutes Notebook
- Further Reading
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow developers to execute a variety of data intensive workloads.
Spark Datasets are strongly typed distributed collections of data created from a variety of sources: JSON and XML files, tables in Hive, external databases and more. Conceptually, they are equivalent to a table in a relational database or a DataFrame in R or Python.
Throughout this tutorial we will use basic Scala syntax.
Learn more about Scala, here’s an excellent introductory tutorial.
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.
If you would like to learn more about Apache Spark visit:
Zeppelin Notebooks supports various interpreters which allow you to perform many operations on your data. Below are just a few of operations you can do with Zeppelin interpreters:
These are some of the interpreters that will be utilized throughout our various Spark tutorials.
|%spark2||Spark interpreter to run Spark 2.x code written in Scala|
|%spark2.sql||Spark SQL interpreter (to execute SQL queries against temporary tables in Spark)|
|%sh||Shell interpreter to run shell commands like move files|
|%angular||Angular interpreter to run Angular and HTML code|
|%md||Markdown for displaying formatted text, links, and images|
Note the % at the beginning of each interpreter. Each paragraph needs to start with % followed by the interpreter name.
Datasets and DataFrames are distributed collections of data created from a variety of sources: JSON and XML files, tables in Hive, external databases and more. Conceptually, they are equivalent to a table in a relational database or a DataFrame in R or Python. Key difference between the Dataset and the DataFrame is that Datasets are strongly typed.
There are complex manipulations possible on Datasets and DataFrames, however they are beyond this quick guide.
We will download and ingest an external dataset about the Silicon Valley Show episodes into a Spark Dataset and perform basic analysis, filtering, and word count.
After a series of transformations, applied to the Datasets, we will define a temporary view (table) such as the one below.
You will be able to explore those tables via SQL queries likes the ones below.
Once you have a handle on the data and perform a basic word count, we will add a few more steps for a more sophisticated word count analysis like the one below.
By the end of this tutorial, you should have a basic understanding of Spark and an appreciation for its powerful and expressive APIs with the added bonus of a developer friendly Zeppelin notebook environment.
Import the Apache Spark in 5 Minutes notebook into your Zeppelin environment. (If at any point you have any issues, make sure to checkout the Getting Started with Apache Zeppelin tutorial).
To import the notebook, go to the Zeppelin home screen.
1. Click Import note
2. Select Add from URL
3. Copy and paste the following URL into the Note URL
# Getting Started ApacheSpark in 5 Minutes Notebook https://raw.githubusercontent.com/hortonworks/data-tutorials/master/tutorials/hdp/hands-on-tour-of-apache-spark-in-5-minutes/assets/Getting%20Started%20_%20Apache%20Spark%20in%205%20Minutes.json
4. Click on Import Note
Once your notebook is imported, you can open it from the Zeppelin home screen by:
5. Clicking Getting Started
6. Select Apache Spark in 5 Minutes
Once the Apache Spark in 5 Minutes notebook is up, follow all the directions within the notebook to complete the tutorial.
We hope that you've been able to successfully run this short introductory notebook and we've got you interested and excited enough to further explore Spark with Zeppelin.
- Hortonworks Apache Spark Tutorials are your natural next step where you can explore Spark in more depth.
- Hortonworks Community Connection (HCC) is a great resource for questions and answers on Spark, Data Analytics/Science, and many more Big Data topics.
- Hortonworks Apache Spark Docs - official Spark documentation.
- Hortonworks Apache Zeppelin Docs - official Zeppelin documentation.