In this blog we are going to see about delta lake in simple manner. So what is delta lake, it is a open source storage layer that brings reliability to data lakes.
It runs on top of existing data lakes and it is fully compatible with Apache spark API
Its Features are:
- Provides ACID transaction on spark
- Time travel
- Log or Audit history.
In data lake when we do overwrite on the existing file and when there was an exception occurs then the original file already present will be deleted automatically, therefore there is no atomicity in data lake.
"But in delta lake, on overwrite the old file will be present.
Along with that a new file will be created with the updated data and there will be log file as well."
Because of that log only, on read it will give the latest file.
And the log file will be maintained in Json format.
We have already said that we get ACID transaction in spark via Delta Lake. Now we wonder what is ACID property, let us have a quick look on it. This will be helpful in many interviews.
- A stands for Atomicity
- C stands for Consistency
- I stands for Isolation
- D stands for Durability
- Atomicity:
It follows the rule that, All Success or All Failure. The example for this is same as scenario as explained above. That is when update is not done successfully, then the old data must be the same. There should not be any loss of data due to update failure. - Consistency:
After any process the data should be maintained consistent. That is all constraint must be maintained. For Example, a Savings bank account balance should not be lesser than Zero. - Isolation :
It is like isolating the table. That is locking a table for a process, so that it avoid other process to update it simultaneously. - Durability:
From the word itself we can identify it defines the Durable. That is, once stored data will not change itself.
Next keyword is used here is Time Travel, let us have very quick look on it.
Time Travel:
We can get the old versions of data before the update, it gives us the feature of time travel to go to the past. So it is called time travel. It can be achieved by using keyword timestampAsOf or versionAsOf.
Let us see some examples for Time travel.
SELECT * FROM table TIMESTAMP as of "2023-02-14 12:00:00"
The above query will return the data of the table , and the data is nothing but the data which was at the timestamp mentioned . Thus it is like time travel it goes to past and get the data for us.
SELECT * FROM table VERSION as of 2
The above query will return the data of the table, and the data is nothing but the data which was at the version number 2. At present you may at 5th version , but if we request for second version it will do a time travel and gives the data exact replica of the version 2.
Hence from this blog, we are able to get crisp detail of Delta Lake and its features as well as a brief detail of ACID transaction as well as Time Travel.
Thank you!!!
0 Comments