Using Spark SQL

Spark SQL lets you query structured data inside Spark programs using either SQL or the DataFrame API.

The entry point to all Spark SQL functionality is the SQLContext class or one of its descendants. You create a SQLContext from a SparkContext. With an SQLContext, you can create a DataFrame from an RDD, a Hive table, or a data source.

To work with data stored in Hive from Spark applications, construct a HiveContext, which inherits from SQLContext. With a HiveContext, you can access tables in the Hive Metastore and write queries using HiveQL. If you use spark-shell, a HiveContext is already created for you and is available as the sqlContext variable. To access Hive tables you must also perform the steps in Accessing Hive from Spark.

For detailed information on Spark SQL, see the Spark SQL and DataFrame Guide.

Spark SQL Example

This example demonstrates how to use sqlContext.sql to create and load a table and select rows from the table into a DataFrame. The next steps use the DataFrame API to filter the rows for salaries greater than 150,000 and show the resulting DataFrame.
  1. At the command-line, copy the Hue sample_07 data to HDFS:
    $ hdfs dfs -put HUE_HOME/apps/beeswax/data/sample_07.csv /user/hdfs
    where HUE_HOME defaults to /opt/cloudera/parcels/CDH/lib/hue (parcel installation) or /usr/lib/hue (package installation).
  2. Start spark-shell:
    $ spark-shell
  3. Create a Hive table:
    scala> sqlContext.sql("CREATE TABLE sample_07 (code string,description string,total_emp int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile")
  4. Load data from HDFS into the table:
    scala> sqlContext.sql("LOAD DATA INPATH '/user/hdfs/sample_07.csv' OVERWRITE INTO TABLE sample_07")
  5. Create a DataFrame containing the contents of the sample_07 table:
    scala> val df = sqlContext.sql("SELECT * from sample_07")
  6. Show all rows with salary greater than 150,000:
    scala> df.filter(df("salary") > 150000).show()
    The output should be:
    |   code|         description|total_emp|salary|
    |11-1011|    Chief executives|   299160|151370|
    |29-1022|Oral and maxillof...|     5040|178440|
    |29-1023|       Orthodontists|     5350|185340|
    |29-1024|     Prosthodontists|      380|169360|
    |29-1061|   Anesthesiologists|    31030|192780|
    |29-1062|Family and genera...|   113250|153640|
    |29-1063| Internists, general|    46260|167270|
    |29-1064|Obstetricians and...|    21340|183600|
    |29-1067|            Surgeons|    50260|191410|
    |29-1069|Physicians and su...|   237400|155150|

Ensuring HiveContext Enforces Secure Access

To ensure that HiveContext enforces ACLs, enable the HDFS-Sentry plug-in as described in Synchronizing HDFS ACLs and Sentry Permissions. Column-level access control for access from Spark SQL is not supported by the HDFS-Sentry plug-in.

Interaction with Hive Views

When a Spark job accesses a Hive view, Spark must have privileges to read the data files in the underlying Hive tables. Currently, Spark cannot use fine-grained privileges based on the columns or the WHERE clause in the view definition. If Spark does not the required privileges on the underlying data files, a SparkSQL query against the view returns an empty result set, rather than an error.