Tuning Spark Applications

Optimizing Spark Performance

CDH 5.3 introduces a performance optimization that causes Spark to prefer RDDs that are already cached locally in HDFS. Using locally cached RDDs is important enough that Spark will wait a short time for the executors near these caches to be free. Spark does not start executors on hosts with cached data, and there is no further chance to select them during the task-matching phase. This is not a problem for most workloads, since most workloads start executors on most or all hosts in the cluster. However, if you do have problems with the optimization, the following constructor:
def  this(config: SparkConf, preferredNodeLocationData: Map[ String, Set[SplitInfo]]) = {
   this.preferredNodeLocationData = preferredNodeLocationData
allows you to explicitly specify the preferred locations to start executors. The following example from examples/src/main/scala/org/apache/spark/examples/SparkHdfsLR.scala, provides an example of using this API:
val sparkConf = new SparkConf().setAppName("SparkHdfsLR")
val inputPath = args(0)
val conf = SparkHadoopUtil.get.newConfiguration()
val sc =  new SparkContext(sparkConf,
     Seq(new InputFormatInfo(conf, classOf[org.apache.hadoop.mapred.TextInputFormat], inputPath))