Ticker

6/recent/ticker-posts

Cache vs Persist in Spark - Databricks Certification

   In this Blog you can read a Crisp and Clear concept of Cache vs Persist in Spark. It is going to be an import topic for Data Engineer interview. And it is also an important topic for Databricks Certified Associate Developer for Apache Spark 3 Exam.  So give more importance to this topic. Surely you will get one or two question in the certification exam on this topic. Lets get started,

  • In simple words both Cache and Persist is  used to save the RDD and dataframe but persist used to save it in the user defined storage level.

  • In simpler way we can say that, both Cache and persist is the optimization mechanism to store the intermediate computation of dataframe or RDD (Resilient Distributed Data) , so that it can be reused by the upcoming actions.

  • The advantage of this is Cost efficient as, spark computation are expensive. So reusing it can be a best solution.

  • It saves the execution time of the job that is we can perform many jobs on same cluster.
  • Note that cache is represented as cache() , that is there is no parameter to it.
  • We can also say that cache() = persist(StorageLevel, MEMORY_AND_DISK)

Example:

When we talk about example, lets see both the spark sql as well as the pyspark way of coding.

First let us see the spark sql,

CACHE TABLE table_name

The above code will cache the entire table. In some of the cases the table may be huge in size. Usually if the size is large usually we should not proceed with cache. We can go but make sure that much resource we have that is the cluster size and the memory allocation should be ampule. So, what may be the solution. The easy way is , as the size of the table is huge then the number of columns will be high. in that case if we need only few of the columns to be cached then we can create a view and load that data to that view and then cache it.

CREATE OR REPLACE  TEMPORARY VIEW  its_a_temp_view AS

SELECT col1, col2,col3 FROM table_name ;


CACHE TABLE global_temp.its_a__temp_view;


From the above code we noticed, we used the temporary view and then load the selected column values to the view and then we cached it. There will be one more scenario as shown below,


CREATE OR REPLACE GLOBAL TEMPORARY VIEW  its_a_global_view AS

SELECT col1, col2,col3  FROM table_name ;


CACHE TABLE global_temp.its_a_global_view;


We used global_temp as we created a global temporary view. We might simply go for temporary view but that cannot be used in another notebook that is running in parallel as the scope of the temporary view is within the notebook . So, are going to global temporary view as that can be used in another databricks notebook that is running in parallel or in that session.


Now lets see how to uncache the cached table, please follow the below command.

UNCACHE TABLE table_name

As shown above it is as simple as it.

Now lets see the pyspark way,
cache_value= df.cache()  

In the above example, df represents the dataframe. And we can reuse the cache_value 


persist_value = df.persist()  

In the above example, df represents the dataframe. And we can reuse the persist_value. And by default it will store it in MEMORY_AND_DISK storage level. 

I know that, now you will be eager to know what are the StorageLevel available.


The various Storage Level are:

  1. MEMORY_ONLY:
        RDD stored as de-serialized Java object in JVM memory

  2. MEMORY_AND_DISK:
        RDD stored as de-serialized Java object in JVM memory and Disk

  3. MEMORY_ONLY_SER:
        RDD stored as serialized Java object in JVM memory only

  4. MEMORY_AND_DISK_SER:
        RDD stored as serialized Java object in JVM memory and Disk

  5. DISK_ONLY:
        RDD stored only on disk as serialized.

Lets see one more example, how top use the various above storage level in order to get the clear picture. As we always know reading with example will make us to remember the concept for longer.

Example:
dfPersist = df.persist(StorageLevel.MEMORY_AND_DISK_SER)
As shown in above it is just a single example of Memory_and_disk_ser and it is  one of the storge level that is discussed.

Now you have a doubt, what is serialization. Let us discuss about it.

  • Serializer performs serialization, which means it converts the JVM object to Stream of Bytes.
  • For the data to travel from one node to another node on shuffle, it needs to be serialized.
  • After reaching to another node it must be de-serialized.
  • Some of the example of serializer are Marshal Serializer, Pickle Serializer.

We can also unpersist the persistence dataframe to remove from memory. And it is achieve by unpersist()

Example:

persist_value = persist_value.unpersist()  

In the above example the persist_value is the already stored dataframe that is the dataframe is already got persisted.

Spark automatically monitors all persist and cache calls that made and drops persisted data which is not used.

At last remember this in simple way,

  •  For RDD cahce( ) -> the default storage level is "MEMORY_ONLY"

  • For Dataset and dataframe cache( ) -> the default storage level is "MEMORY_AND_DISK"

  •  Similarly for persist( ) -> the default storage level is "MEMORY_AND_DISK"

Cache and the persist will help us in the optimization of the code. as instead of reading the data directly from the files , it is reading from the internal memory , hence the timing will be reduced in the reading part. However for optimization cache and persist is not the single solution . it depends on the scenarios. But we can give a first try to the cache and persist when it comes to performance improvement.

Thus in this blog we saw about the cache and persist in detail. And we see everything with example and that will help in your certification exams. And we also noted that the cache is also helpful in the performance optimization , note it down as a important point.

From the above topic , definitely you will be getting question in Databricks Certified Associate Developer for Apache Spark 3 Exam. So now you are ready to Answer that Question. If you like this blog you can follow this blog by clicking on the follow button at the top right of this screen. 

All the Best!!!

Post a Comment

1 Comments

  1. clear and to the point explaination thank you very much

    ReplyDelete

Ad Code