Ticker

6/recent/ticker-posts

Delta Live Table with example|Databricks steps- Datacloudy


 

In this blog we are going to see about a very new concept called Delta Live Table in Databricks. We are also  going to see how to create the pipeline as well in step by step manner in crisp and clear way along with example. Hope you are exited, let us get into the topic.

Delta Live Table:

    It is a framework for building reliable, maintainable and testable data processing pipelines. It performs and manages task, Orchestration, cluster management, monitoring, data quality and error handling.

Now, let us see the step by step process we have to do in Databricks.

Here we have to follow the Multi hop architecture as Delta live table is a framework.

Step 1:

Declare Bronze Table:

        Bronze table only holds the raw data. Here we are going to see an example where we are creating 2 bronze table.

CREATE OR REFRESH STREAMING LIVE TABLE bronze_table_1_name

AS SELECT * FROM

cloud_files("path",json",map("cloudFiles.schemaHints,time DOUBLE"))  

In the first bronze table we are considering in this example as loading the files in incremental fashion.
That can be achieved by mentioning as cloud_files . In this example we consider the files are in json format, so we mentioned in the query.

As an example we are going to create the second table as mentioned before,

CREATE OR REFRESH STREAMING LIVE TABLE bronze_table_2_name

AS SELECT * FROM

cloud_files("path",csv",map("cloudFiles.inferColumnTypes,true"))  

here we consider the files to be csv format.

Step 2:

 Declare Silver Table,

Remember, here we can give CONSTRAINT, EXPECT and VIOLATION. These are the key points. Silver Table contains the cleaned data from the raw data. The source for Silver zone is Bronze zone.

CREATE OR REFRESH STREAMING LIVE TABLE silver_table_name

(

CONSTRAINT column_name EXPECT(col_name>0) ON VIOLATION Drop Row

)

AS SELECT * FROM STREAM(live.bronze_table_1_name) a

 INNER JOIN STREAM(live.bronze_table_2_name) b on a.key_col=b.key_col

From this example we can understand that we can give constraint and the expected value via EXPECT keyword. And on violation of the expect rule we gave Drop Row, which means that particular record which is violating the rule will be dropped.

The types of Violation rule we can provide are,

  • FAIL UPDATE : 

                Pipeline will be failed when constraint is violated

  • DROP ROW:

                Discard the records that violates the constraint.

  • Omitted:

                This means when we doesn't gave any keyword, that is in default, the records violating constraints will be included but violation will be reported in metrics.

Step 3:

Declare Gold Table:


Gold tables consists of Aggregated values and the statistics concluded here will be the downstream to Presentation layer, that is it is given to visualization tools like Power BI, Tableau etc.

CREATE OR REFRESH STREAMING LIVE TABLE gold_table_name

AS SELECT col1,col2,sum(col3)

FROM STREAM(live.silver_table_name)

GROUP BY col1,col2  

 

 Step 4 :

Follow the below steps in Databricks,

  • Go to Jobs section
  • Click the Delta Live Table tab at the top of the screen
  • Click Create Delta Live table
  • And there you can add the notebook that we are created in the above steps.

     Thus, in this blog we saw about Delta Live Tables, How to create the pipeline in step by step manner and also we saw about the types of violation rule we can provide as well. Hope this content will be helpful and gave a clear cut idea about Delta Live Tables.

Thank You!!!

Post a Comment

0 Comments

Ad Code