Settings are set in a number of ways;

  • $SPARK_HOME/conf/spark-defaults.conf – make a copy of the template and edit this if the file does not exist.
  • On the command line when submitting jobs or starting up the shell
  • Directly on the SparkContext object.

Shuffling
Spark stores intermediate data on disk from a shuffle operation as part of its “under-the-hood” optimization. When spark has to recompute a portion of a RDD graph, it may be able to truncate the lineage of a RDD graph if the RDD is already there as a side effect of an earlier shuffle. This can happen even if the RDD is not cached or explicitly persisted. Set the spark.shuffle.spill=false to turn this off if it is not needed.

Caching
The caching mechanism reserves a % of memory from the executor. This is specified in spark.storage.memoryFraction

Partitioning
Use more partitions as data size increases

Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

Memory Leaks
Closing over objects in lambdas can cause memory leaks. Check the size of the serialized Spark task to make sure there are no leaks.