Spark and Hadoop Integration

This section describes how to access various Hadoop ecosystem components from Spark.

Accessing HBase from Spark

To configure Spark to interact with HBase, you can specify an HBase service as a Spark service dependency in Cloudera Manager:

In the Cloudera Manager admin console, go to the Spark service you want to configure.
Go to the Configuration tab.
Enter hbase in the Search box.
In the HBase Service property, select your HBase service.
Enter a Reason for change, and then click Save Changes to commit the changes.

You can use Spark to process data that is destined for HBase. See Importing Data Into HBase Using Spark.

You can also use Spark in conjunction with Apache Kafka to stream data from Spark to HBase. See Importing Data Into HBase Using Spark and Kafka.

The host from which the Spark application is submitted or on which spark-shell or pyspark runs must have an HBase gateway role defined in Cloudera Manager and client configurations deployed.

Limitation with Region Pruning for HBase Tables

When SparkSQL accesses an HBase table through the HiveContext, region pruning is not performed. This limitation can result in slower performance for some SparkSQL queries against tables that use the HBase SerDes than when the same table is accessed through Impala or Hive.

Accessing Hive from Spark

The host from which the Spark application is submitted or on which spark-shell or pyspark runs must have a Hive gateway role defined in Cloudera Manager and client configurations deployed.

When a Spark job accesses a Hive view, Spark must have privileges to read the data files in the underlying Hive tables. Currently, Spark cannot use fine-grained privileges based on the columns or the WHERE clause in the view definition. If Spark does not have the required privileges on the underlying data files, a SparkSQL query against the view returns an empty result set, rather than an error.

Running Spark Jobs from Oozie

You can invoke Spark jobs from Oozie using the Spark action. For information on the Spark action, see Oozie Spark Action Extension.

Categories: Developers | HBase | Hadoop | Spark | All Categories

Tuning Spark Applications

Building and Running a Crunch Application with Spark