Accessing External Storage from Spark

Spark can access all storage sources supported by Hadoop, including a local file system, HDFS, HBase, and Amazon S3.

Spark supports many file types, including text files, RCFile, SequenceFile, Hadoop InputFormat, Avro, Parquet, and compression of all supported files.

For developer information about working with external storage, see External Storage in the Spark Programming Guide.

Accessing Compressed Files

You can read compressed files using one of the following methods:
  • textFile(path)
  • hadoopFile(path,outputFormatClass)
You can save compressed files using one of the following methods:
  • saveAsTextFile(path, compressionCodecClass="codec_class")
  • saveAsHadoopFile(path,outputFormatClass, compressionCodecClass="codec_class")
where codec_class is one of the classes in Compression Types.

For examples of accessing Avro and Parquet files, see Spark with Avro and Parquet.