In this Blog you can read a Crisp and Clear concept of Deploy Mode in Spark.

Please give more important to this topic, as it is not only important for Databricks Certified Associate Developer for Apache Spark 3 Exam but also for interview point of view. So, lets get started.

Apache Spark, an open-source distributed computing system, has gained immense popularity for its speed and ease of use in processing large-scale data. One of the critical aspects of deploying Spark applications is understanding and choosing the right deploy mode. Deploy mode refers to how a Spark application is executed on a cluster, and it significantly impacts the performance, resource utilization, and scalability of your Spark jobs.

In simple words,

" Deploy Mode in Spark specifies where exactly the driver program will run. "

 

The three deploy modes in spark are,

  1. Spark Cluster Mode
  2. Spark Client Mode
  3. Local Mode
Out of which Spark Cluster Mode and spark Client mode are important in the examination point of view. Surely you will be getting one or two questions based on this topic in Databricks Certified Associate Developer for Apache Spark 3. 


The default deploy mode is the Spark Client Mode.  You can declare which deply mode in spark submit command, let us see the command,

            spark-submit --deploy-mode [client/cluster]                                

In the above command instead of the [client/ cluster] ,we have to give the correct deploy mode based on our requirement. The correct way of using this command and as well as when to use which deploy mode all the details are given in the below topics.

Spark Cluster Mode: 

  • Cluster mode is the go-to deploy mode for production scenarios where Spark applications need to process large datasets on a cluster of machines.

  • In the Spark Cluster Mode, the driver program runs on the worker node inside the cluster.


              Spark Cluster Mode



  • Therefore there will be no latency , that is no Network overhead.
  • In this mode, if the machine or the user session running spark-submit terminates, then the application will not terminate, since the driver is running on the cluster. Therefore issue of disconnection which lead to job failure reduced.

  • Therefore this mode is used for Production job. And it is not used for interactive purpose.
Now let us see the spark submit command where you can define the cluster mode,


        spark-submit --deploy-mode cluster --driver-memory....                        


Spark Client Mode:

  •  Client Mode is also called as "driver mode"

  • In the Sqpark Client Mode, the driver program runs on the external Client.


              Spark Client Mode



  • Here, there will be latency , that is there will be Network overhead.
  • In this mode, if the machine or the user session running spark-submit terminates, then the application will also terminate. Issue of disconnection will arise that lead to job failure.

  • when you use Ctrl-c after submitting the spark-submit command will also leads to terminates of your application.

  • Therefore this mode is not used for Production job. It is used for only testing purpose. That is Client mode is used only for interactive purpose.
Now let us see the spark submit command where you can define the client mode.

            spark-submit --deploy-mode client --driver-memory ....                         


Local Mode:

  • Local mode is the simplest deploy mode and is primarily used for development, debugging, and testing purposes

  • It's an excellent choice for small datasets and allows developers to iterate quickly without the complexities of a distributed environment.

  • In local Mode, the application run on a single machine, there is no cluster.

  • Local mode is beneficial for development, it's not suitable for processing large datasets or leveraging the full power of distributed computing.

  • To launch a Spark application in local mode, you simply set the master URL to "local" in your SparkConf or through the command line. For example:

spark-submit --master local[number] your_spark_application_name.py

 

The number in square brackets denotes the number of CPU cores to use. If not specified, Spark will use all available cores. And Please note local mode is not so important in the examination point of view. Give more importance to client mode and cluster mode.


The choice of deploy mode depends on the specific requirements of your Spark application. Here are some considerations to help you decide which mode to use:

1) Development and Testing:
  • For quick development cycles, use local mode.

  • For testing in a multi-node environment, consider client mode to have more control over the driver program.

2) Production Workloads:

  • For large-scale data processing, use cluster mode for optimal performance and resource utilization.

  • If your client machine has sufficient resources and you need to interact with the driver program, consider client mode.
3) Resource Utilization:

  • Cluster mode allows better resource isolation and scalability across multiple nodes.

4) Fault Tolerance:

  • Cluster mode provides automatic recovery in case of node failures.

  • Client mode relies on the client machine for driver program execution and may require manual intervention in case of failures.

5) Data Locality:

  • Cluster mode ensures data locality by distributing tasks to nodes with local data.

  • Client mode may involve data transfer between the client and cluster, potentially impacting performance.

And the Conclusion is, Understanding Spark deploy modes is crucial for optimizing the performance and scalability of your Spark applications. Whether you're in the development phase or preparing for production, choosing the right deploy mode ensures that your Spark jobs run efficiently and effectively on your cluster. By considering factors such as data size, resource availability, and fault tolerance, you can make informed decisions on whether to deploy in local, cluster, or client mode. Remember, the right deploy mode can make a significant difference in the success of your Spark-based big data projects.

From the above topic , definitely you will be getting question in Databricks Certified Associate Developer for Apache Spark 3 Exam. So now you are ready to Answer that Question. Hope this information is helpful.

If you like this blog, please follow us by clicking follow button on this top right side of this window.
You can also follow us on social media as well and that is also given below the follow button.

All the Best!!!