Apache Spark Incompatible Changes and Limitations

  • If you have uploaded the Spark assembly JAR file to HDFS, you must upload the new version of the file each time you upgrade Spark to a new minor CDH release (for example, any CDH 5.2, 5.3, 5.4, or 5.5 release). You may also need to modify the configured path for the file; see the instructions under the CDH 5.2 section.
  • CDH 5.5
    • Dynamic allocation is enabled by default but is not compatible with streaming. For streaming jobs, if you specify --num-executors, then dynamic allocation is implicitly disabled. To be safe, you can explicitly disable dynamic allocation using: spark.dynamicAllocation.enabled = false
    • The CDH 5.5 version of Spark 1.5 differs from the Apache Spark 1.5 release in using Akka version 2.2.3, the version used by Spark 1.1 and CDH 5.2. Apache Spark 1.5 uses Akka version 2.3.11.
  • CDH 5.4
    • The CDH 5.4 version of Spark 1.3 differs from the Apache Spark 1.3 release in using Akka version 2.2.3, the version used by Spark 1.1 and CDH 5.2. Apache Spark 1.3 uses Akka version 2.3.4.
  • CDH 5.3
    • Spark 1.2, on which CDH 5.3 is based, does not expose a transitive dependency on the Guava library. As a result, projects that use Guava but do not explicitly add it as a dependency will need to be modified: the dependency must be added to the project and also packaged with the job.
    • The CDH 5.3 version of Spark 1.2 differs from the Apache Spark 1.2 release in using Akka version 2.2.3, the version used by Spark 1.1 and CDH 5.2. Apache Spark 1.2 uses Akka version 2.3.4.
  • CDH 5.2
    • The configured paths for spark.eventLog.dir, spark.history.fs.logDirectory, and the SPARK_JAR environment variable have changed in a way that may not be backward-compatible. By default, those paths now refer to the local filesystem. To make sure everything works as before, modify the paths as follows:
      • For HDFS, if this is not a federated cluster, prepend hdfs: to the path.
      • For HDFS in a federated cluster, prepend viewfs: to the path.
      Alternatively, you can prepend the value of fs.defaultFS, set in core-site.xml in the HDFS configuration.
    • The following changes may affect existing applications:
      • The default for I/O compression is now Snappy (changed from LZF).
      • PySpark now performs external spilling during aggregations.
    • The following Spark-related artifacts are no longer published as part of the Cloudera repository:
      • spark-assembly: The spark-assembly JAR is used internally by Spark distributions when running Spark applications and should not be referenced directly. Instead, projects should add dependencies for those parts of the Spark project that are being used, for example, spark-core.
      • spark-yarn
      • spark-tools
      • spark-examples
      • spark-repl
  • CDH 5.1
    • Before you can run Spark in standalone mode, you must set the spark.master property in /etc/spark/conf/spark-defaults.conf, as follows:
      where MASTER_IP is the IP address of the host the Spark master is running on and MASTER_PORT is the port.

      This setting means that all jobs will run in standalone mode by default; you can override the default on the command line.

    • Includes changes that will enable Spark to avoid breaking compatibility in the future. As a result, most applications will require a recompile to run against Spark 1.0, and some will require changes in source code. The details are as follows:
      • This release has two changes in the core Scala API:
        • The cogroup and groupByKey operators now return an Iterator over their values instead of a Seq. This change means that the set of values corresponding to a particular key need not all reside in memory at the same time.
        • SparkContext.jarOfClass now returns Option[String] instead of Seq[String].
      • Spark Java APIs have been updated to accommodate Java 8 lambdas. See Migrating from pre-1.0 Versions of Spark for more information. CDH 5.1 does not support Java 8, which is supported as of CDH 5.3.