Ticker

6/recent/ticker-posts

Mount ADLS to Databricks | Steps - Datacloudy


 

In this blog we are going to see how we can mount the ADLS to Databricks. There are multiple ways to do this but we see the easier one. As the motto of this blog is "crisp and clear".

Overview of this Page:

SL.No Topic Short Description
1 ADLS overview of ADLS
2 Databricks overview of Databricks
3 The Synergy of ADLS and Databricks synergy details
4 Steps steps to be followed for Mounting ADLS with Databricks
5 Benefits few benefits of mounting ADLS with Databricks

Before going into that, let us see what is ADLS and Databricks.

ADLS:

  •  ADLS stands for Azure Data Lake Storage. Azure Data Lake Storage is a scalable and secure data lake solution offered by Microsoft Azure. It provides a centralized repository for storing structured and unstructured data at any scale. ADLS is designed to handle massive amounts of data and allows users to perform analytics on it seamlessly. With features such as hierarchical namespace, fine-grained access control, and deep integration with other Azure services, ADLS becomes an ideal choice for enterprises seeking a comprehensive solution for their data storage needs.

  • It is a secured and scalable Data Lake that helps to achieve high performance Analytical Workload. In general we use ADLS gen 2 as it has hierarchical file storage. And it has significant performance and security advantages on Analytical workloads.
Databricks:

  •      Databricks, on the other hand, is a cloud-based platform for big data analytics and machine learning. Built on Apache Spark, Databricks offers a collaborative environment that brings together data engineers, data scientists, and business analysts. 

  • The platform simplifies the process of building, training, and deploying machine learning models at scale. With Databricks, organizations can accelerate their analytics workflows, derive meaningful insights, and make informed decisions based on data-driven intelligence.

The Synergy of ADLS and Databricks
  • The integration of ADLS with Databricks creates a powerful synergy, enabling organizations to seamlessly analyze and derive insights from their data. One of the key features that facilitates this integration is the ability to mount ADLS directly onto Databricks clusters.

  •  This process involves connecting ADLS as an external storage source to Databricks, allowing users to access and process data stored in ADLS using Databricks notebooks and clusters.

So, Mounting is that we can access the files in ADLS from Databricks and it is as simple as syncing the ADLS with Databricks.

So let us, see the process in step by step manner. As we previously mention there are many methods to do it. Now we are going to see how we can do with Account Key.


                        


Before proceeding with the steps, let us see what are the Prerequisites.

  • An active Microsoft Azure subscription.
  • Azure Data Lake Storage Gen2 account.
  • Azure Databricks Workspace.
  • Azure Key Vault - (it is used to store the key in secret manner).
Now let us see the steps,

Step 1: 

In the First step, we need to go to storage account and note down the Account Key from Access Key Option.

  1.     Navigate to Storage Account:
        -- In the left sidebar, click on "All services."
        -- In the search bar, type "Storage accounts" and select it.
        -- Find and click on your Azure Storage Account.

  2. Access Keys:
        -- In the left sidebar, under "Settings," select "Access keys."
        -- You will see a list of keys associated with your storage account.

  3. Retrieve the Storage Account Key:
        -- Copy either of the keys listed under "Key" or "Connection string." 
        --  The keys are labeled as key1 and key2. It's common to use key1, but both keys are interchangeable.

Step 2:

  • In step to we are going to write a simple code in databricks. In Databricks note book, we are going to use its one of the utility that is "fs" . fs means File System.

That is shown below,


dbutils.fs.mount(

source=f"wasbs://{conatiner_name}@{Storage_name}.blob.core.windows.net", 

mount_point=f"/mnt/{mount_point}", 

extra_configs={f"fs.azure.account.key.{Storage_name}.blob.core.windows.net:{key}"

}

)   

 

  •  In the above code, container_name must contain respective azure ADLS container name, Storage_name must be replaced with your azure ADLS storage_name, mount_point should contain mount_point directory name and  Key is nothing but the account key that is fetched in step 1.
  • Usually the key must be stored in Azure Key Vault and that can be fetched using utilities by,


  dbutils.secrets.get(scope=scope_name,key=key_name)  


 

Step 3:

  • Now we can access the files in ADLS from Databricks by directly mentioning the mount point folder.

  • let us assume a file in ALDS container , and the file name is abc.csv. we are mounted as mounted_dir . Now let us see how to access it.

df1=spark.read.option("header",true).csv("/mnt/mounted_dir/abc.csv)

 

Benefits :

  • Unified Data Environment: The integration creates a unified environment where data engineers, data scientists, and analysts can collaborate seamlessly, breaking down silos and accelerating the data analytics lifecycle.

  • Scalability and Performance: Leveraging the scalability of both ADLS and Databricks, organizations can handle large datasets and complex analytics workloads with ease, ensuring optimal performance and faster time-to-insight.

  • Security and Compliance: ADLS provides robust security features, and the integration with Databricks ensures that these security measures extend to the analytics environment. This combination allows organizations to maintain compliance with data governance standards.

  • Cost Efficiency: With data stored in ADLS and processed in Databricks, organizations can optimize costs by leveraging the pay-as-you-go model offered by cloud services. This flexibility ensures that resources are allocated efficiently, aligning with the specific needs of the analytics workload.

  • Streamlined Data Pipelines: Mounting ADLS with Databricks simplifies the creation of end-to-end data pipelines. Users can ingest, process, and analyze data seamlessly, facilitating the development of comprehensive data workflows.
Disadvantages:
  1. Security Concerns:
        -- Key Exposure Risk: Account keys provide full control over the storage account. If the keys are exposed or compromised, an attacker could gain unauthorized access to the entire storage account.
        -- Limited Security Scope: Using account keys gives broad access, and it's challenging to limit access to specific operations or subsets of data. This lack of granularity might be a security concern.
  2. Key Rotation Complexity:  Regularly rotating storage account keys is a security best practice, but it can be operationally challenging. Changing keys requires updating configurations in all services using the keys, which can cause downtime if not managed carefully.

  3. Limited Auditing: Account keys provide access without detailed auditing capabilities. It can be challenging to trace and monitor specific activities at a granular level, making it difficult to identify potential security incidents or anomalies.

From this blog we understood, how to  mount the ADLS to Databricks and access the files in ADLS in step by step manner. We also saw about the explanations of ADLS, databricks, the Synergy of ADLS and Databricks and the benefits and disadvantages of mounting ADLS with Databricks.

Hope this information might be helpful. You can follow our page by clicking follow button on the top right of the screen . You can follow us in Linked in, facebook, instagram and twitter using the social media plugin is given above the follow button.


Thank you !!!


Post a Comment

0 Comments

Ad Code