Spark 2 Known Issues

The following sections describe the current known issues and limitations in Cloudera Distribution of Apache Spark 2. In some cases, a feature from the upstream Apache Spark project is currently not considered reliable enough to be supported by Cloudera. For a number of integration features in CDH that rely on Spark, the feature does not work with Cloudera Distribution of Apache Spark 2 because CDH components are not introducing dependencies on Spark 2.

Empty result when reading Parquet table created by saveAsTable()

After a Parquet table is created by the saveAsTable() function, Spark SQL queries against the table return an empty result set. The issue is caused by the "path" property of the table not being written to the Hive metastore during the saveAsTable() call.

Bug: SPARK-21994

Affects: Cloudera Distribution of Apache Spark 2.1 release 2, Cloudera Distribution of Apache Spark 2.2 release 1

Severity: High

Workaround: You can set the path manually before the call to saveAsTable():

val options = Map("path" -> "/path/to/hdfs/directory/containing/table")
df.write.options(options).saveAsTable("db_name.table_name")

Or you can add the path to the metastore when the table already exists, for example:

spark.sql("alter table db_name.table_name set SERDEPROPERTIES ('path'='hdfs://host.example.com:8020/warehouse/path/db_name.db/table_name')")
spark.catalog.refreshTable("db_name.table_name")

Spark 2 Version Requirement for Clusters Managed by Cloudera Manager

All CDH clusters managed by a single Cloudera Manager instance must use exactly the same Spark 2 version. Make sure to install or upgrade the CSDs and parcels across all machines of all clusters at the same time.

Spark Standalone

Spark Standalone is not supported for Spark 2.

Spark2 On HBase is not Supported

Spark On HBase is a CDH component that has a dependency on Spark 1.6. Because CDH components do not have any dependencies on Spark 2, Spark On HBase does not work with Spark 2.

Dynamic allocation and Spark Streaming

If you are using Spark Streaming, Cloudera recommends that you disable dynamic allocation by setting spark.dynamicAllocation.enabled to false when running streaming applications.

Structured Streaming is not supported

Cloudera does not support the Structured Streaming API.

Oozie Spark2 Action is not Supported

The Oozie Spark action is a CDH component that has a dependency on Spark 1.6. Because CDH components do not have any dependencies on Spark 2, the Oozie Spark action does not work with Spark 2.

SparkR is not Supported

SparkR is not supported for Spark 2. (SparkR is also not supported in CDH with Spark 1.6.)

GraphX is not Supported

GraphX is not supported for Spark 2. (GraphX is also not supported in CDH with Spark 1.6.)

Thrift Server

The Thrift JDBC/ODBC server is not supported for Spark 2. (The Thrift server is also not supported in CDH with Spark 1.6.)

Spark SQL CLI is not Supported

The Spark SQL CLI is not supported for Spark 2. (The Spark SQL CLI is also not supported in CDH with Spark 1.6.)

Rolling Upgrades are not Supported

Rolling upgrades are not possible from Spark 1.6 bundled with CDH, to the Cloudera Distribution of Apache Spark 2.

Package Install is not Supported

The Cloudera Distribution of Apache Spark 2 is only installable as a parcel.

Hardware Acceleration for MLlib is not Supported

This feature is part of the GPL Extras package for CDH.

Cost Based Optimization is not Supported

The Cost Based Optimization feature is not supported in Spark 2.2. Do NOT set the spark.sql.cbo.enabled configuration option to true.