Accessing Parquet Files From Spark SQL Applications

Spark SQL supports loading and saving DataFrames from and to a variety of data sources and has native support for Parquet. For information about Parquet, see Using Apache Parquet Data Files with CDH.

To read Parquet files in Spark SQL, use the SQLContext.read.parquet("path") method.

To write Parquet files in Spark SQL, use the DataFrame.write.parquet("path") method.

To set the compression type, configure the spark.sql.parquet.compression.codec property:

sqlContext.setConf("spark.sql.parquet.compression.codec","codec")

The supported codec values are: uncompressed, gzip, lzo, and snappy. The default is gzip.

Currently, Spark looks up column data from Parquet files by using the names stored within the data files. This is different than the default Parquet lookup behavior of Impala and Hive. If data files are produced with a different physical layout due to added or reordered columns, Spark still decodes the column data correctly. If the logical layout of the table is changed in the metastore database, for example through an ALTER TABLE CHANGE statement that renames a column, Spark still looks for the data using the now-nonexistent column name and returns NULLs when it cannot locate the column values. To avoid behavior differences between Spark and Impala or Hive when modifying Parquet tables, avoid renaming columns, or use Impala, Hive, or a CREATE TABLE AS SELECT statement to produce a new table and new set of Parquet files containing embedded column names that match the new layout.

For an example of writing Parquet files to Amazon S3, see Reading and Writing Data Sources From and To Amazon S3.

Accessing Avro Data Files From Spark SQL Applications

Building Spark Applications