spark-submit
#
spark-submit
is the command used to submit applications to a Spark cluster. It is a powerful tool that allows you to configure various settings for your Spark jobs, including memory and CPU allocation, cluster modes, and application-specific parameters. Properly configuring spark-submit
is essential for optimizing Spark jobs for performance and resource usage.
Key Configurations in spark-submit
#
Below are the most important configurations you can use with spark-submit
, along with their purposes and examples:
1. Application Resource Configuration #
--master
: Specifies the cluster manager to connect to. It can belocal
for local mode,yarn
for Hadoop YARN,mesos
for Apache Mesos, ork8s
for Kubernetes.- Example:
--master yarn
- Example:
--deploy-mode
: Defines whether to launch the driver on the worker nodes (cluster
) or locally on the machine submitting the application (client
).- Example:
--deploy-mode cluster
- Example:
--num-executors
: Sets the number of executors to use for the job. This is applicable in cluster modes like YARN.- Example:
--num-executors 5
- Example:
--executor-cores
: Specifies the number of CPU cores per executor. Higher values increase parallelism.- Example:
--executor-cores 4
- Example:
--executor-memory
: Allocates memory for each executor process. Proper sizing can prevent out-of-memory errors.- Example:
--executor-memory 8G
- Example:
--driver-memory
: Sets the amount of memory allocated for the driver process.- Example:
--driver-memory 4G
- Example:
2. Configuration for Dynamic Resource Allocation #
--conf spark.dynamicAllocation.enabled=true
: Enables dynamic allocation of executors. Spark will scale the number of executors up and down based on workload.- Example:
--conf spark.dynamicAllocation.enabled=true
- Example:
--conf spark.dynamicAllocation.minExecutors
: Minimum number of executors to be allocated when dynamic allocation is enabled.- Example:
--conf spark.dynamicAllocation.minExecutors=2
- Example:
--conf spark.dynamicAllocation.maxExecutors
: Maximum number of executors Spark can allocate.- Example:
--conf spark.dynamicAllocation.maxExecutors=10
- Example:
3. Resource Management and Scheduling #
--conf spark.yarn.executor.memoryOverhead
: Extra memory to be allocated per executor for JVM overheads. This is useful for managing memory more effectively.- Example:
--conf spark.yarn.executor.memoryOverhead=1024
- Example:
--conf spark.scheduler.mode
: Configures the scheduling mode (FIFO
orFAIR
). FAIR scheduling allows jobs to share resources more evenly.- Example:
--conf spark.scheduler.mode=FAIR
- Example:
--conf spark.locality.wait
: Adjusts the amount of time Spark waits to launch tasks on preferred nodes before scheduling elsewhere. Helps in managing locality.- Example:
--conf spark.locality.wait=3s
- Example:
4. Spark Logging and Debugging #
--conf spark.eventLog.enabled=true
: Enables Spark event logging. This helps in monitoring and debugging by storing event information.- Example:
--conf spark.eventLog.enabled=true
- Example:
--conf spark.eventLog.dir
: Specifies the directory where the event logs should be stored.- Example:
--conf spark.eventLog.dir=hdfs:///logs/
- Example:
--conf spark.executor.logs.rolling.strategy=time
: Sets the rolling strategy for executor logs. Useful for managing log file sizes and retention.- Example:
--conf spark.executor.logs.rolling.strategy=time
- Example:
--conf spark.executor.logs.rolling.time.interval=daily
: Defines the interval for rolling executor logs.- Example:
--conf spark.executor.logs.rolling.time.interval=daily
- Example:
5. Application Specific Configurations #
--conf spark.sql.shuffle.partitions
: Configures the number of partitions to use when shuffling data during Spark SQL operations. Tweaking this number can optimize shuffling.- Example:
--conf spark.sql.shuffle.partitions=200
- Example:
--conf spark.serializer
: Specifies the serializer for RDDs. The default is Java serialization, but Kryo serialization can be more efficient.- Example:
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
- Example:
--conf spark.executor.extraJavaOptions
: Passes additional JVM options for executors. Useful for setting system properties or managing JVM memory.- Example:
--conf spark.executor.extraJavaOptions="-XX:+UseG1GC"
- Example:
6. Security Configurations #
--conf spark.authenticate=true
: Enables authentication for Spark communication to enhance security.- Example:
--conf spark.authenticate=true
- Example:
--conf spark.authenticate.secret
: Defines the secret key for Spark authentication.- Example:
--conf spark.authenticate.secret=mySecretKey
- Example:
--conf spark.ssl.enabled=true
: Enables SSL for all Spark communication.- Example:
--conf spark.ssl.enabled=true
- Example:
Example spark-submit
Command #
Below is an example of a spark-submit
command using several of these configurations:
spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 5 \
--executor-cores 4 \
--executor-memory 8G \
--driver-memory 4G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=2 \
--conf spark.dynamicAllocation.maxExecutors=10 \
--conf spark.yarn.executor.memoryOverhead=1024 \
--conf spark.scheduler.mode=FAIR \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=hdfs:///logs/ \
--conf spark.sql.shuffle.partitions=200 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.executor.extraJavaOptions="-XX:+UseG1GC" \
--conf spark.authenticate=true \
--conf spark.authenticate.secret=mySecretKey \
--class com.example.MySparkApp \
/path/to/my-spark-app.jar