In this Blog you can read a Crisp and Clear concept of Query Planning of Spark Sql Engine. This is an very important topic in the point of view of Databricks Certified Associate Developer for Apache Spark 3 Exam Certification . And it is also an important topics in Data Engineering job role interviews. So kindly don't skip this topic. In this blog we are also going to see the "Explain" command and its importance and how it is useful. Whatever may be the code, whether its pyspark or sparksql it will go through the spark sql Engine and the planning of the query will be taken place.
Now lets, get started.
When we are talking about Query Planning of Spark Sql Engine , we need to discuss about 4 Stages
- Analysis
- Logical Optimization
- Physical Planning
- Code Generation
where DAG, represents execution plan of Spark Job.
Now you wondering what is DAG. It is a separate topic but it is always better to know about it. So before going to see about Flow of Query Planning. Let us quickly see about DAG.
DAG stands for Directed Acyclic Graph. In Spark, this is used for the visual representation of RDDs and the operations being performed on them. Apache spark is basically a tree representation it's a set of vertices and edges where the vertices represents the RDD's and the edges represent the operations to be applied. Fault tolerance is achieved by this. Now let us see about the flow Diagram and the explanation in brief manner.
The Flow of Query Planning goes as shown below:
- Column Name and the datatypes of the query were checked at the Analysis stage with the help of CATALOG
- Therefore now we have corrected code or valid code. Thus Unresolved Logical plan in converted to Logical plan , and that is happened in Analysis Stage.
- Optimized Logical plan is obtained from logical plan by Logical Optimization stage, where it apply logical optimization like Predicate Pushdown, Null propagation etc
- Next stage is physical plan , here it creates multiple physical plans and applies cost based optimization
- Only one physical plan is selected based on the less cost.
- The example for selecting one physical plan from multiple physical plans is like, which join gives less cost for the query that is, broadcast join or Sort Merge join. And only one plan is chosen which has less cost
- In the Code Generation stage , the Engine will create required Java Byte Code for RDD operation.
We can see that in the above topic we can see that we mentioned RDD. What is RDD, let us see that as a last topic in this blog. RDD stands for Resilient Distributed Datasets. In Apache spark, to perform any kind of data processing your data must be represent in RDD or Dataframe. And RDD is the fundamental data structure of spark. RDD is an immutable distribution of collection of objects.
We can see the execution plan by executing below command
Explain Formatted select col1,col2 from table_name
This command will be useful when we are going in to the optimization part. Because after knowing about the execution plan only we can understand how this code will behave and in which part we need to consider for the optimization.
Like if we are doing multiple joins in a query. And we know that if a table size is less than 10 MB (that is 10 MB is the default value of broadcast join threshold) that table will be broadcasted. And these details will be showed in the "Explain" command. So we can go with cache some of the table, so that the spark will read the data from cache instead of direct read of the table files. This will increase the read performance. So we will do a cache on some table.
As we talked about cache lets see a quick note on cache. Its is nothing but storing the data in-memory so that the next time spark needs that specific data it need not read from the file, rather it can read the data from the memory. We can perform cache using below command.
CACHE TABLE table_name
Let assume the table may have many columns and out of which we need very few columns then there is no need to cache the entire table. Hence we can create a temporary view and cache that view.
CREATE OR REPLACE GLOBAL TEMPORARY VIEW its_a_view AS
SELECT col1, col2 FROM table_name ;
CACHE TABLE global_temp.its_a_view;
We used global_temp as we created a global temporary view. We might simply go for temporary view but that cannot be used in another notebook that is running in parallel as the scope of the temporary view is within the notebook . So, are going to global temporary view as that can be used in another databricks notebook that is running in parallel or in that session.
Now coming back to the point, we can confirm the cache is done properly or not via spark UI in the storage tab. But to confirm that the spark reads from the cache or directly from the table file, at that time we can use the Explain command. This will clearly shows whether the data is read from the cache . It will show like " In memory" read has been happened, that indicates its a cache read.
From the above topic , definitely you will be getting question on Databricks Certified Associate Developer for Apache Spark 3 Exam. So now you are ready to Answer that Question. As well as you can give a very clear answer in Data Engineer job role interview as well. Hope this information is helpful.
If you like this blog, you can follow the blog by clicking on follow button on the right side of this window. Feel free to follow my linked in profile and other social media as well for the updates on important topics that you need to know in crisp and clear manner.
All the Best !!!
%3CmxGraphModel%3E%3Croot%3E%3CmxCell%20id%3D%220%22%2F%3E%3CmxCell%20id%3D%221%22%20parent%3D%220%22%2F%3E%3CmxCell%20id%3D%222%22%20value%3D%22Logical%20Plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22120%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%223%22%20value%3D%22Optimized%20Logical%20Plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22200%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%224%22%20value%3D%22Physical%20Plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22280%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%225%22%20value%3D%22selected%20physical%20plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22360%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3C%2Froot%3E%3C%2FmxGraphModel%3E
%3CmxGraphModel%3E%3Croot%3E%3CmxCell%20id%3D%220%22%2F%3E%3CmxCell%20id%3D%221%22%20parent%3D%220%22%2F%3E%3CmxCell%20id%3D%222%22%20value%3D%22Un%20resolved%20Logic%20Plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%2240%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%223%22%20value%3D%22Logical%20Plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22120%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%224%22%20value%3D%22Optimized%20Logical%20Plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22200%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%225%22%20value%3D%22Physical%20Plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22280%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%226%22%20value%3D%22selected%20physical%20plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22360%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%227%22%20value%3D%22RDD%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22440%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3C%2Froot%3E%3C%2FmxGraphModel%3E
%3CmxGraphModel%3E%3Croot%3E%3CmxCell%20id%3D%220%22%2F%3E%3CmxCell%20id%3D%221%22%20parent%3D%220%22%2F%3E%3CmxCell%20id%3D%222%22%20value%3D%22Un%20resolved%20Logic%20Plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%2240%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%223%22%20value%3D%22Logical%20Plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22120%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%224%22%20value%3D%22Optimized%20Logical%20Plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22200%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%225%22%20value%3D%22Physical%20Plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22280%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%226%22%20value%3D%22selected%20physical%20plan%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22360%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3CmxCell%20id%3D%227%22%20value%3D%22RDD%22%20style%3D%22rounded%3D1%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3B%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20x%3D%22100%22%20y%3D%22440%22%20width%3D%22120%22%20height%3D%2260%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3C%2Froot%3E%3C%2FmxGraphModel%3E
0 Comments