CDS Powered by Apache Spark Known Issues

Spark 2 Version Requirement for Clusters Managed by Cloudera Manager

All CDH clusters managed by a single Cloudera Manager instance must use exactly the same version of CDS Powered By Apache Spark. Make sure to install or upgrade the CSDs and parcels across all machines of all clusters at the same time.

Spark Standalone

Spark Standalone is not supported for Spark 2.

HiveOnSpark is not Supported with Spark 2

The HiveOnSpark module is a CDH 5 component that has a dependency on Apache Spark 1.6. Because CDH 5 components do not have any dependencies on Spark 2, the HiveOnSpark module does not work with CDS Powered by Apache Spark. You can still use Spark 2 with Hive using other methods.

SparkOnHBase is not Supported with Spark 2

The SparkOnHBase module is a CDH 5 component that has a dependency on Apache Spark 1.6. Because CDH 5 components do not have any dependencies on Spark 2, the SparkOnHBase module does not work with CDS Powered by Apache Spark. You can still use Spark 2 with HBase using other methods.

Using the JDBC Datasource API to access Hive or Impala is not supported

Structured Streaming is not supported

Cloudera does not support the Structured Streaming API because it is an experimental API.

Spark Streaming Direct Connector is not Supported

Spark 2 does not support the Spark Streaming direct connector that uses the new Kafka consumer API, available starting Apache Kafka 0.9 (Cloudera Kafka 2.0) for secure clusters. Therefore, you cannot use Spark 2 to read data from Kafka using the new direct connector. Consequently, you cannot read data from a secure cluster that uses Kerberos with Spark Streaming. You can still use the older Spark Streaming direct connector, which uses the old Kafka consumer API, to read data from Kafka in a non-secure cluster.

Oozie Spark2 Action is not Supported

The Oozie Spark action is a CDH component that has a dependency on Spark 1.6. Because CDH components do not have any dependencies on Spark 2, the Oozie Spark action does not work with Spark 2.

SparkR is not Supported

SparkR is not supported for Spark 2. (SparkR is also not supported in CDH with Spark 1.6.)

GraphX is not Supported

GraphX is not supported for Spark 2. (GraphX is also not supported in CDH with Spark 1.6.)

Thrift Server

The Thrift JDBC/ODBC server is not supported for Spark 2. (The Thrift server is also not supported in CDH with Spark 1.6.)

Spark SQL CLI is not Supported

The Spark SQL CLI is not supported for Spark 2. (The Spark SQL CLI is also not supported in CDH with Spark 1.6.)

Kudu is not Supported

The Kudu integration for Spark only works with Spark 1.6.

Rolling Upgrades are not Supported

Rolling upgrades are not possible from Spark 1.6 bundled with CDH, to CDS 2 Powered by Apache Spark.

Package Install is not Supported

CDS 2 Powered by Apache Spark is only installable as a parcel.

Spark Avro is not Supported

The spark-avro library is not integrated into the Spark 2.0 parcel.

Accessing Multiple Clusters Simultaneously Not Supported

Spark does not support accessing multiple clusters in the same application.

Hardware Acceleration for MLlib is not Supported

This feature, part of the GPL Extras package for CDH, is not supported with the CDS Powered By Apache Spark 2. This feature is supported for Spark 1.6.

Long-running apps on a secure cluster might fail if driver is restarted

If you submit a long-running app on a secure cluster using the --principal and --keytab options in cluster mode, and a failure causes the driver to restart after 7 days (the default maximum HDFS delegation token lifetime), the new driver fails with an error similar to the following:

Exception in thread "main" org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token <token_info> can't be found in cache

Workaround: None

Affected Versions: All CDS 2.0, 2.1, and 2.2 releases

Fixed Versions: CDS 2.3 Release 2

Apache Issue: SPARK-23361

Cloudera Issue: CDH-64865