Certification Details in short:

  • 60 questions 2 hours
  • At least 70%(42 questions) for pass 
  • Architecture -17 questions
  • Dataframe API applications- 43 questions

Syllabus:

Understanding the basics of the Spark architecture, including Adaptive Query Execution

Apply the Spark DataFrame API to complete individual data manipulation task, including: 

        selecting, renaming and manipulating columns

        filtering, dropping, sorting, and aggregating rows

        joining, reading, writing and partitioning DataFrames

working with UDFs and Spark SQL functions


  1. As there is more weightage is for Dataframe API so make sure you are confident in select, withColumn, withColumnRenamed, filter, drop, sort, groupBy, agg, join.
  2. Second part is we need to learn the Spark Architecture.

Make sure you are comfortable using the documentation for reference. Because they will provide the documentation during the exam. And that is cool right!

Apart from these, I want to share that you need to focus on below points that are important topics will surely come in exam and also I have given some crisp details for some of the points.

       You need to Focus on some topics for the Exam:

1.Make yourself comfortable on different ways of creating the Dataframe

2.Explode :
    Actually its not a method of Dataframe hence we need to import it,

    from pyspark.sql.functions import explode

    This is useful when the datatype is array, that is  a column has list of elements and you need to make each element a separate row

               eg:  input

id 

names

1

["abc","bcd"]

                     output:

id 

names

1

abc

1

bcd


3.Narrow and Wide Transformation :

    Narrow: Not results in Shuffling  eg: select, filter

    Wide:     Results in Shuffling    eg: join, groupBy

4. Difference between Transformation and Action:

    Transformation    -   eg: orderBy, filter, groupBy

    Actions                -      eg: show(),count(),take()

5.Sample:

    df.sample(withReplacement,fraction,seed)

    eg: new dataframe with 50 percent of random records from dataframe df without replacement.

    df.sample(False,0.5,5)  - here seed is random value

6.Unix_Timestamp and from_unixtime 

    unix_timestamp("date","yyyy-mm-dd")

7.Usage of Lit

8.UDF: 

    Make yourself clear with registering UDF and calling it.

9.Execution /Deployment Mode :

    Learn in detail about:

      1. Spark Cluster Mode

      2. Spark Client Mode

      click here to get details of Deployment Mode.

10. Cache and Persist:

    Learn about various storage level,
    Check for default storage level of persist.
    Note: There is no parameter for Cache()

     click here to get short and clear deatils of Cache and Persist.

11.Unpersist:

    df.unpersist()

12.Query Planning :

      know about the Query Planning flow.
      click here to get crisp and clear idea of Query Planning of Spark Sql Engine 


13. Pow :

    Note, both the representations are correct:
      pow("col",2)
      pow("col", lit(2))

14. Architecture:

    Learn about- Task, Stage, Job

15.Adaptive Execution Query:
    Properties:
      Coalescing Post Shuffle Partition
      Converting the join type
      Optimizing skew joins

       click here to get clear idea about Adaptive Execution Query for exam.

16. Partition Pruning and Predicate Pushdown 

      Reading the data from the partitioned folder alone based on the filter value. First Filter comes into picture then scanning.

17. Repartition and Coalesce

       Repartition used to increase or decrease the number of partitions. But it cause shuffle.
       Coalesce used to decrease the number of partitions and it doesn't cause shuffle.
       

18.Responsibility of Executor: 

      The executors accept tasks from the driver, execute those tasks, and return results to the driver.

19. Broadcast variable

       Broadcast variables are immutable and lazily replicated across all nodes in the cluster when an action is triggered.

20.Make yourself comfort on writing the Dataframe 

I have given all the details with very short note and I am sure this will help you for preparation and clearing the exam to the point. If you are needed explanation for each of the points in detail, mention in comment box.


All the Best!!!