Using Spark SQL

SQLContext and HiveContext

The entry point to all Spark SQL functionality is the SQLContext class or one of its descendants. You create a SQLContext from a SparkContext. With an SQLContext, you can create a DataFrame from an RDD, a Hive table, or a data source.

To work with data stored in Hive or Impala tables from Spark applications, construct a HiveContext, which inherits from SQLContext. With a HiveContext, you can access Hive or Impala tables represented in the metastore database.

If you use spark-shell, a HiveContext is already created for you and is available as the sqlContext variable.

If you use spark-submit, use code like the following at the start of the program:

Python:

from pyspark import SparkContext, HiveContext
sc = SparkContext(appName = "test")
sqlContext = HiveContext(sc)

The host from which the Spark application is submitted or on which spark-shell or pyspark runs must have a Hive gateway role defined in Cloudera Manager and client configurations deployed.

When a Spark job accesses a Hive view, Spark must have privileges to read the data files in the underlying Hive tables. Currently, Spark cannot use fine-grained privileges based on the columns or the WHERE clause in the view definition. If Spark does not have the required privileges on the underlying data files, a SparkSQL query against the view returns an empty result set, rather than an error.

Querying Files Into a DataFrame

If you have data files that are outside of a Hive or Impala table, you can use SQL to directly read JSON or Parquet files into a DataFrame:

  • JSON:
    df = sqlContext.sql("SELECT * FROM json.`input dir`")
  • Parquet:
    df = sqlContext.sql("SELECT * FROM parquet.`input dir`")

See Running SQL on Files.

Spark SQL Example

This example demonstrates how to use sqlContext.sql to create and load a table and select rows from the table into a DataFrame. The next steps use the DataFrame API to filter the rows for salaries greater than 150,000 and show the resulting DataFrame.
  1. At the command-line, copy the Hue sample_07 data to HDFS:
    $ hdfs dfs -put HUE_HOME/apps/beeswax/data/sample_07.csv /user/hdfs
    where HUE_HOME defaults to /opt/cloudera/parcels/CDH/lib/hue (parcel installation) or /usr/lib/hue (package installation).
  2. Start spark-shell:
    $ spark-shell
  3. Create a Hive table:
    scala> sqlContext.sql("CREATE TABLE sample_07 (code string,description string,total_emp int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile")
  4. Load data from HDFS into the table:
    scala> sqlContext.sql("LOAD DATA INPATH '/user/hdfs/sample_07.csv' OVERWRITE INTO TABLE sample_07")
  5. Create a DataFrame containing the contents of the sample_07 table:
    scala> val df = sqlContext.sql("SELECT * from sample_07")
  6. Show all rows with salary greater than 150,000:
    scala> df.filter(df("salary") > 150000).show()
    The output should be:
    +-------+--------------------+---------+------+
    |   code|         description|total_emp|salary|
    +-------+--------------------+---------+------+
    |11-1011|    Chief executives|   299160|151370|
    |29-1022|Oral and maxillof...|     5040|178440|
    |29-1023|       Orthodontists|     5350|185340|
    |29-1024|     Prosthodontists|      380|169360|
    |29-1061|   Anesthesiologists|    31030|192780|
    |29-1062|Family and genera...|   113250|153640|
    |29-1063| Internists, general|    46260|167270|
    |29-1064|Obstetricians and...|    21340|183600|
    |29-1067|            Surgeons|    50260|191410|
    |29-1069|Physicians and su...|   237400|155150|
    +-------+--------------------+---------+------+

Ensuring HiveContext Enforces Secure Access

To ensure that HiveContext enforces ACLs, enable the HDFS-Sentry plug-in as described in Synchronizing HDFS ACLs and Sentry Permissions. Column-level access control for access from Spark SQL is not supported by the HDFS-Sentry plug-in.

Interaction with Hive Views

When a Spark job accesses a Hive view, Spark must have privileges to read the data files in the underlying Hive tables. Currently, Spark cannot use fine-grained privileges based on the columns or the WHERE clause in the view definition. If Spark does not have the required privileges on the underlying data files, a SparkSQL query against the view returns an empty result set, rather than an error.