In this Blog you can read a Crisp and Clear concept of Cache vs Persist in Spark. It is going to be an import topic for Data Engineer interview. And it is also an important topic for Databricks Certified Associate Developer for Apache Spark 3 Exam. So give more importance to this topic. Surely you will get one or two question in the certification exam on this topic. Lets get started,
- In simple words both Cache and Persist is used to save the RDD and dataframe but persist used to save it in the user defined storage level.
- In simpler way we can say that, both Cache and persist is the optimization mechanism to store the intermediate computation of dataframe or RDD (Resilient Distributed Data) , so that it can be reused by the upcoming actions.
- The advantage of this is Cost efficient as, spark computation are expensive. So reusing it can be a best solution.
- It saves the execution time of the job that is we can perform many jobs on same cluster.
- Note that cache is represented as cache() , that is there is no parameter to it.
- We can also say that cache() = persist(StorageLevel, MEMORY_AND_DISK)
CACHE TABLE table_name
The above code will cache the entire table. In some of the cases the table may be huge in size. Usually if the size is large usually we should not proceed with cache. We can go but make sure that much resource we have that is the cluster size and the memory allocation should be ampule. So, what may be the solution. The easy way is , as the size of the table is huge then the number of columns will be high. in that case if we need only few of the columns to be cached then we can create a view and load that data to that view and then cache it.
CREATE OR REPLACE TEMPORARY VIEW its_a_temp_view AS
SELECT col1, col2,col3 FROM table_name ;
CACHE TABLE global_temp.its_a__temp_view;
From the above code we noticed, we used the temporary view and then load the selected column values to the view and then we cached it. There will be one more scenario as shown below,
CREATE OR REPLACE GLOBAL TEMPORARY VIEW its_a_global_view AS
SELECT col1, col2,col3 FROM table_name ;
CACHE TABLE global_temp.its_a_global_view;
We used global_temp as we created a global temporary view. We might simply go for temporary view but that cannot be used in another notebook that is running in parallel as the scope of the temporary view is within the notebook . So, are going to global temporary view as that can be used in another databricks notebook that is running in parallel or in that session.
UNCACHE TABLE table_name
cache_value= df.cache()
In the above example, df represents the dataframe. And we can reuse the cache_value
persist_value = df.persist()
In the above example, df represents the dataframe. And we can reuse the persist_value. And by default it will store it in MEMORY_AND_DISK storage level.
I know that, now you will be eager to know what are the StorageLevel available.
- MEMORY_ONLY:
RDD stored as de-serialized Java object in JVM memory - MEMORY_AND_DISK:
RDD stored as de-serialized Java object in JVM memory and Disk - MEMORY_ONLY_SER:
RDD stored as serialized Java object in JVM memory only - MEMORY_AND_DISK_SER:
RDD stored as serialized Java object in JVM memory and Disk - DISK_ONLY:
RDD stored only on disk as serialized.
dfPersist = df.persist(StorageLevel.MEMORY_AND_DISK_SER)
- Serializer performs serialization, which means it converts the JVM object to Stream of Bytes.
- For the data to travel from one node to another node on shuffle, it needs to be serialized.
- After reaching to another node it must be de-serialized.
- Some of the example of serializer are Marshal Serializer, Pickle Serializer.
persist_value = persist_value.unpersist()
In the above example the persist_value is the already stored dataframe that is the dataframe is already got persisted.
- For RDD cahce( ) -> the default storage level is "MEMORY_ONLY"
- For Dataset and dataframe cache( ) -> the default storage level is "MEMORY_AND_DISK"
- Similarly for persist( ) -> the default storage level is "MEMORY_AND_DISK"
1 Comments
clear and to the point explaination thank you very much
ReplyDelete