Databricks Certified Associate Developer for Apache Spark 3 Exam Preparation

Certification Details in short:

60 questions 2 hours
At least 70%(42 questions) for pass
Architecture -17 questions
Dataframe API applications- 43 questions

Syllabus:

Understanding the basics of the Spark architecture, including Adaptive Query Execution

Apply the Spark DataFrame API to complete individual data manipulation task, including:

selecting, renaming and manipulating columns

filtering, dropping, sorting, and aggregating rows

joining, reading, writing and partitioning DataFrames

working with UDFs and Spark SQL functions

As there is more weightage is for Dataframe API so make sure you are confident in select, withColumn, withColumnRenamed, filter, drop, sort, groupBy, agg, join.
Second part is we need to learn the Spark Architecture.

Make sure you are comfortable using the documentation for reference. Because they will provide the documentation during the exam. And that is cool right!

Apart from these, I want to share that you need to focus on below points that are important topics will surely come in exam and also I have given some crisp details for some of the points.

You need to Focus on some topics for the Exam:

1.Make yourself comfortable on different ways of creating the Dataframe

2.Explode :
Actually its not a method of Dataframe hence we need to import it,

from pyspark.sql.functions import explode

This is useful when the datatype is array, that is a column has list of elements and you need to make each element a separate row

eg: input

id	names
1	["abc","bcd"]

output:

id	names
1	abc
1	bcd

3.Narrow and Wide Transformation :

Narrow: Not results in Shuffling eg: select, filter

Wide: Results in Shuffling eg: join, groupBy

4. Difference between Transformation and Action:

Transformation - eg: orderBy, filter, groupBy

Actions - eg: show(),count(),take()

5.Sample:

df.sample(withReplacement,fraction,seed)

eg: new dataframe with 50 percent of random records from dataframe df without replacement.

df.sample(False,0.5,5) - here seed is random value

6.Unix_Timestamp and from_unixtime

unix_timestamp("date","yyyy-mm-dd")

7.Usage of Lit

8.UDF:

Make yourself clear with registering UDF and calling it.

9.Execution /Deployment Mode :

Learn in detail about:

1. Spark Cluster Mode

2. Spark Client Mode

click here to get details of Deployment Mode.

10. Cache and Persist:

   Learn about various storage level,
   Check for default storage level of persist.
   Note: There is no parameter for Cache()

click here to get short and clear deatils of Cache and Persist.

11.Unpersist:

df.unpersist()

12.Query Planning :

know about the Query Planning flow.
click here to get crisp and clear idea of Query Planning of Spark Sql Engine

13. Pow :

   Note, both the representations are correct:
  pow("col",2)
  pow("col", lit(2))

14. Architecture:

Learn about- Task, Stage, Job

15.Adaptive Execution Query:
   Properties:
  Coalescing Post Shuffle Partition
  Converting the join type
  Optimizing skew joins

click here to get clear idea about Adaptive Execution Query for exam.

16. Partition Pruning and Predicate Pushdown

Reading the data from the partitioned folder alone based on the filter value. First Filter comes into picture then scanning.

17. Repartition and Coalesce

Repartition used to increase or decrease the number of partitions. But it cause shuffle.
Coalesce used to decrease the number of partitions and it doesn't cause shuffle.

18.Responsibility of Executor:

The executors accept tasks from the driver, execute those tasks, and return results to the driver.

19. Broadcast variable

Broadcast variables are immutable and lazily replicated across all nodes in the cluster when an action is triggered.

20.Make yourself comfort on writing the Dataframe

I have given all the details with very short note and I am sure this will help you for preparation and clearing the exam to the point. If you are needed explanation for each of the points in detail, mention in comment box.

All the Best!!!

Ticker

Databricks Certified Associate Developer for Apache Spark 3 Exam Preparation

Post a Comment

2 Comments

Followers

Search This Blog

About Me

Labels

Popular Posts

Mount ADLS with Databricks| SAS token-Datacloudy