Apache Spark Known Issues
Apache Spark experimental features are not supported unless specifically identified as supported
If an Apache Spark feature or API is identified as experimental, in general Cloudera does not provide support for it.
ADLS not Supported for All Spark Components
Microsoft Azure Data Lake Store (ADLS) is a cloud-based filesystem that you can access through Spark applications. Hive-on-Spark and Spark with Kudu are not currently supported for ADLS data.
IPython / Jupyter notebooks not supported
- The IPython notebook system (renamed to Jupyter as of IPython 4.0) is not supported.
Certain Spark Streaming features not supported
- The mapWithState method is unsupported because it is a nascent unstable API
Certain Spark SQL features not supported
- Thrift JDBC/ODBC server
- Spark SQL CLI
Spark Dataset API not supported
Cloudera does not support the Spark Dataset API.
GraphX not supported
Cloudera does not support GraphX.
SparkR not supported
Cloudera does not support SparkR.
Scala 2.11 not supported
Cloudera does not support Spark on Scala 2.11 because it is binary incompatible, and also not yet full-featured.
Spark Streaming cannot consume from secure Kafka till it starts using Kafka 0.9 Consumer API
Tables saved with the Spark SQL DataFrame.saveAsTable method are not compatible with Hive
Writing a DataFrame directly to a Hive table creates a table that is not compatible with Hive; the metadata stored in the metastore can only be correctly interpreted by Spark. For example:
val hsc = new HiveContext(sc) import hsc.implicits._ val df = sc.parallelize(data).toDF() df.write.format("parquet").saveAsTable(tableName)
creates a table with this metadata:
This is also occurs when using explicit schema, such as:
val schema = StructType(Seq(...)) val data = sc.parallelize(Seq(Row(...), …)) val df = hsc.createDataFrame(data, schema) df.write.format("parquet").saveAsTable(tableName)
Workaround: Explicitly create a Hive table to store the data. For example:
df.registerTempTable(tempName) hsc.sql(s""" CREATE TABLE $tableName ( // field definitions ) STORED AS $format """) hsc.sql(s"INSERT INTO TABLE $tableName SELECT * FROM $tempName")
Cannot create Parquet tables containing date fields in Spark SQL
Exception in thread "main" org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.java.lang.UnsupportedOperationException: Parquet does not support date. at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:433)
This is due to a limitation (HIVE-6384) in the version of Hive (1.1) included in CDH 5.5.0.
Spark SQL does not support the union type
Tables containing union fields cannot be read or created using Spark SQL.
Spark SQL does not respect size limit for the varchar type
Spark SQL treats varchar as string (that is, there no size limit). The observed behavior is that Spark reads and writes these columns as regular strings; if inserted values exceed the size limit, no error will occur. The data will be truncated when read from Hive, but not when read from Spark.
Spark SQL does not support the char type
Spark SQL does not support the char type (fixed-length strings). Like unions, tables with such fields cannot be created from or read by Spark.
Spark SQL does not support transactional tables
Spark SQL does not support Hive transactions ("ACID").
Spark SQL does not prevent you from writing key types not supported by Avro tables
Spark allows you to declare DataFrames with any key type. Avro supports only string keys and trying to write any other key type to an Avro table will fail.
Spark SQL does not support timestamp in Avro tables
Spark SQL does not support all ‘ANALYZE TABLE COMPUTE STATISTICS’ syntax
ANALYZE TABLE <table name> COMPUTE STATISTICS NOSCAN works. ANALYZE TABLE <table name> COMPUTE STATISTICS (without noscan) and ANALYZE TABLE <table name> COMPUTE STATISTICS FOR COLUMNS both return errors.
Spark SQL statements that can result in table partition metadata changes may fail
Because Spark does not have access to Sentry data, it may not know that a user has permissions to execute an operation and instead fail it. SQL statements that can result in table partition metadata changes, for example, "ALTER TABLE" or "INSERT", may fail.
Spark SQL does not respect Sentry ACLs when communicating with Hive metastore
Even if user is configured via Sentry to not have read permission to a Hive table, a Spark SQL job running as that user can still read the table's metadata directly from the Hive metastore.
Dynamic allocation and Spark Streaming
If you are using Spark Streaming, Cloudera recommends that you disable dynamic allocation by setting spark.dynamicAllocation.enabled to false when running streaming applications.
Spark uses Akka version 2.2.3
The CDH 5.5 version of Spark 1.5 differs from the Apache Spark 1.5 release in using Akka version 2.2.3, the version used by Spark 1.1 and CDH 5.2. Apache Spark 1.5 uses Akka version 2.3.11.
Spark standalone mode does not work on secure clusters
Workaround: On secure clusters, run Spark applications on YARN.