In this Blog you can read a Crisp and Clear concept of Deploy Mode in Spark.
Please give more important to this topic, as it is not only important for Databricks Certified Associate Developer for Apache Spark 3 Exam but also for interview point of view. So, lets get started.
Apache Spark, an open-source distributed computing system, has gained immense popularity for its speed and ease of use in processing large-scale data. One of the critical aspects of deploying Spark applications is understanding and choosing the right deploy mode. Deploy mode refers to how a Spark application is executed on a cluster, and it significantly impacts the performance, resource utilization, and scalability of your Spark jobs.
In simple words,
" Deploy Mode in Spark specifies where exactly the driver program will run. "
The three deploy modes in spark are,
- Spark Cluster Mode
- Spark Client Mode
- Local Mode
spark-submit --deploy-mode [client/cluster]
Spark Cluster Mode:
- Cluster mode is the go-to deploy mode for production scenarios where Spark applications need to process large datasets on a cluster of machines.
- In the Spark Cluster Mode, the driver program runs on the worker node inside the cluster.
- Therefore there will be no latency , that is no Network overhead.
- In this mode, if the machine or the user session running spark-submit terminates, then the application will not terminate, since the driver is running on the cluster. Therefore issue of disconnection which lead to job failure reduced.
- Therefore this mode is used for Production job. And it is not used for interactive purpose.
spark-submit --deploy-mode cluster --driver-memory....
Spark Client Mode:
- Client Mode is also called as "driver mode"
- In the Sqpark Client Mode, the driver program runs on the external Client.
- Here, there will be latency , that is there will be Network overhead.
- In this mode, if the machine or the user session running spark-submit terminates, then the application will also terminate. Issue of disconnection will arise that lead to job failure.
- when you use Ctrl-c after submitting the spark-submit command will also leads to terminates of your application.
- Therefore this mode is not used for Production job. It is used for only testing purpose. That is Client mode is used only for interactive purpose.
spark-submit --deploy-mode client --driver-memory ....
Local Mode:
- Local mode is the simplest deploy mode and is primarily used for development, debugging, and testing purposes
- It's an excellent choice for small datasets and allows developers to iterate quickly without the complexities of a distributed environment.
- In local Mode, the application run on a single machine, there is no cluster.
- Local mode is beneficial for development, it's not suitable for processing large datasets or leveraging the full power of distributed computing.
- To launch a Spark application in local mode, you simply set the master URL to "local" in your SparkConf or through the command line. For example:
spark-submit --master local[number] your_spark_application_name.py
- For quick development cycles, use local mode.
- For testing in a multi-node environment, consider client mode to have more control over the driver program.
- For large-scale data processing, use cluster mode for optimal performance and resource utilization.
- If your client machine has sufficient resources and you need to interact with the driver program, consider client mode.
- Cluster mode allows better resource isolation and scalability across multiple nodes.
- Cluster mode provides automatic recovery in case of node failures.
- Client mode relies on the client machine for driver program execution and may require manual intervention in case of failures.
- Cluster mode ensures data locality by distributing tasks to nodes with local data.
- Client mode may involve data transfer between the client and cluster, potentially impacting performance.
0 Comments