X

Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

Cloudera named a leader in 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems Get the report

Ready to Get Started?

 

NOTICE

 

As of January 31, 2021, this tutorial references legacy products that no longer represent Cloudera’s current product offerings.

Please visit recommended tutorials:

 

Introduction

In this tutorial, we will provide an overview of Apache Spark, it's relationship with Scala, Zeppelin notebooks, Interpreters, Datasets and DataFrames. Finally, we will showcase Apache Zeppelin notebook for our development environment to keep things simple and elegant.

Zeppelin will allow us to run in a pre-configured environment and execute code written for Spark in Scala and SQL, a few basic Shell commands, pre-written Markdown directions, and an HTML formatted table.

To make things fun and interesting, we will introduce a film series dataset from the Silicon Valley Comedy TV show and perform some basic operations with Spark in Zeppelin.

Prerequisites

Outline

Concepts

Apache Spark

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow developers to execute a variety of data intensive workloads.

Spark Logo

Spark Datasets are strongly typed distributed collections of data created from a variety of sources: JSON and XML files, tables in Hive, external databases and more. Conceptually, they are equivalent to a table in a relational database or a DataFrame in R or Python.

New to Scala?

Throughout this tutorial we will use basic Scala syntax.

Learn more about Scala, here’s an excellent introductory tutorial.

New to Zeppelin?

If you haven’t already, checkout the Hortonworks Apache Zeppelin page as well as the Getting Started with Apache Zeppelin tutorial. You will find the official Apache Zeppelin page here.

New to Spark?

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

If you would like to learn more about Apache Spark visit:

What are Interpreters?

Zeppelin Notebooks supports various interpreters which allow you to perform many operations on your data. Below are just a few of operations you can do with Zeppelin interpreters:

  • Ingestion
  • Munging
  • Wrangling
  • Visualization
  • Analysis
  • Processing

These are some of the interpreters that will be utilized throughout our various Spark tutorials.

Interpreter Description
%spark2 Spark interpreter to run Spark 2.x code written in Scala
%spark2.sql Spark SQL interpreter (to execute SQL queries against temporary tables in Spark)
%sh Shell interpreter to run shell commands like move files
%angular Angular interpreter to run Angular and HTML code
%md Markdown for displaying formatted text, links, and images

Note the % at the beginning of each interpreter. Each paragraph needs to start with % followed by the interpreter name.

Learn more about Zeppelin interpreters.

What are Datasets and DataFrames?

Datasets and DataFrames are distributed collections of data created from a variety of sources: JSON and XML files, tables in Hive, external databases and more. Conceptually, they are equivalent to a table in a relational database or a DataFrame in R or Python. Key difference between the Dataset and the DataFrame is that Datasets are strongly typed.

There are complex manipulations possible on Datasets and DataFrames, however they are beyond this quick guide.

Learn more about Datasets and DataFrames.

Apache Spark in 5 Minutes Notebook Overview

Silicon Valley Image

We will download and ingest an external dataset about the Silicon Valley Show episodes into a Spark Dataset and perform basic analysis, filtering, and word count.

After a series of transformations, applied to the Datasets, we will define a temporary view (table) such as the one below.

DataFrame Contents Table

You will be able to explore those tables via SQL queries likes the ones below.

Complex SQL Query Graph

Once you have a handle on the data and perform a basic word count, we will add a few more steps for a more sophisticated word count analysis like the one below.

Improved Word Count Sample

By the end of this tutorial, you should have a basic understanding of Spark and an appreciation for its powerful and expressive APIs with the added bonus of a developer friendly Zeppelin notebook environment.

Import the Apache Spark in 5 Minutes Notebook

Import the Apache Spark in 5 Minutes notebook into your Zeppelin environment. (If at any point you have any issues, make sure to checkout the Getting Started with Apache Zeppelin tutorial).

To import the notebook, go to the Zeppelin home screen.

1. Click Import note

2. Select Add from URL

3. Copy and paste the following URL into the Note URL

# Getting Started ApacheSpark in 5 Minutes Notebook

https://raw.githubusercontent.com/hortonworks/data-tutorials/master/tutorials/hdp/hands-on-tour-of-apache-spark-in-5-minutes/assets/Getting%20Started%20_%20Apache%20Spark%20in%205%20Minutes.json




4. Click on Import Note

Once your notebook is imported, you can open it from the Zeppelin home screen by:

5. Clicking Getting Started

6. Select Apache Spark in 5 Minutes

Once the Apache Spark in 5 Minutes notebook is up, follow all the directions within the notebook to complete the tutorial.

Summary

We hope that you've been able to successfully run this short introductory notebook and we've got you interested and excited enough to further explore Spark with Zeppelin.

Further Reading

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.