Certification Details in short:
- 60 questions 2 hours
- At least 70%(42 questions) for pass
- Architecture -17 questions
- Dataframe API applications- 43 questions
Syllabus:
Understanding the basics of the Spark architecture, including Adaptive Query Execution
Apply the Spark DataFrame API to complete individual data manipulation task, including:
selecting, renaming and manipulating columns
filtering, dropping, sorting, and aggregating rows
joining, reading, writing and partitioning DataFrames
working with UDFs and Spark SQL functions
- As there is more weightage is for Dataframe API so make sure you are confident in select, withColumn, withColumnRenamed, filter, drop, sort, groupBy, agg, join.
- Second part is we need to learn the Spark Architecture.
Make sure you are comfortable using the documentation for reference. Because they will provide the documentation during the exam. And that is cool right!
Apart from these, I want to share that you need to focus on below points that are important topics will surely come in exam and also I have given some crisp details for some of the points.
You need to Focus on some topics for the Exam:
1.Make yourself comfortable on different ways of creating the Dataframe
2.Explode :
Actually its not a method of Dataframe hence we need to import it,
This is useful when the datatype is array, that is a column has list of elements and you need to make each element a separate row
eg: input
id |
names |
1 |
["abc","bcd"] |
output:
id |
names |
1 |
abc |
1 |
bcd |
Narrow: Not results in Shuffling eg: select, filter
Wide: Results in Shuffling eg: join, groupBy
4. Difference between Transformation and Action:
Transformation - eg: orderBy, filter, groupBy
Actions - eg: show(),count(),take()
5.Sample:
df.sample(withReplacement,fraction,seed)
eg: new dataframe with 50 percent of random records from dataframe df without replacement.
df.sample(False,0.5,5) - here seed is random value
6.Unix_Timestamp and from_unixtime
unix_timestamp("date","yyyy-mm-dd")
7.Usage of Lit
8.UDF:
Make yourself clear with registering UDF and calling it.
9.Execution /Deployment Mode :
Learn in detail about:
1. Spark Cluster Mode
2. Spark Client Mode
click here to get details of Deployment Mode.
10. Cache and Persist:
Learn about various storage level,
Check for default storage level of persist.
Note: There is no parameter for Cache()
click here to get short and clear deatils of Cache and Persist.
11.Unpersist:
df.unpersist()
12.Query Planning :
know about the Query Planning flow.
click here to get crisp and clear idea of Query Planning of Spark Sql Engine
13. Pow :
Note, both the representations are correct:
pow("col",2)
pow("col", lit(2))
14. Architecture:
Learn about- Task, Stage, Job
15.Adaptive Execution Query:
Properties:
Coalescing Post Shuffle Partition
Converting the join type
Optimizing skew joins
click here to get clear idea about Adaptive Execution Query for exam.
16. Partition Pruning and Predicate Pushdown
Reading the data from the partitioned folder alone based on the filter value. First Filter comes into picture then scanning.
17. Repartition and Coalesce
Repartition used to increase or decrease the number of partitions. But it cause shuffle.
Coalesce used to decrease the number of partitions and it doesn't cause shuffle.
18.Responsibility of Executor:
The executors accept tasks from the driver, execute those tasks, and return results to the driver.
19. Broadcast variable
Broadcast variables are immutable and lazily replicated across all nodes in the cluster when an action is triggered.
20.Make yourself comfort on writing the Dataframe
I have given all the details with very short note and I am sure this will help you for preparation and clearing the exam to the point. If you are needed explanation for each of the points in detail, mention in comment box.
All the Best!!!
2 Comments
Thank you very much,can you give details of all the points mentioned here or for atleast Pow,Partition prunning and prediction pushdown , sample.
ReplyDeleteThanks for the comment, I gave only the overview of those topic, will try to add the complete details for those as well.
Delete