Getting the AWS Big Data Certification

“Why am I so nervous? I haven’t felt like this since… I took the first one.”

After five AWS exams (particularly the two pro exams) and a handful of other certification exams, I didn’t really expect to be nervous taking this exam. Oddly enough, this is exactly how I felt going in, during, and in that inexorable “moment” of time between clicking the Submit Exam button and actually seeing the result on the page. After some reflection, I believe (for me, at least) that two key factors contributed to this:

  • The sheer amount of information encompassed by the exam
  • My unfamiliarity with many of the services. For the pro exams, most of the services were services I used on a day-to-day basis (some less frequently, but still a somewhat regular amount of usage and exposure to the service).

Of those two factors, the second factor was the weightier of the two. In a lot of ways, I really only focused my study for the pro exams on services that I didn’t use frequently. Being more of a “DevOps” than “Big Data” person at the moment, my exposure to the encompassed services is quite limited. For me, the best way to study is hands-on usage, so in addition to Sanjay Kotecha’s excellent A Cloud Guru course (which I would consider a bedrock foundational resource in studying for the exam), I would suggest playing around with as many of these services as are realistic to play around with leading up to the exam. Some examples:

  • I created an ML regression model based on Litecoin pricing to get an idea of what the market might look like in the days to come
  • Setting up a Firehose Delivery Stream just to see how it works
  • Spin up a quick EMR cluster with a Hive script step
  • Buy an IoT button and do something cool with it (and understand how it interacts with the IoT service in the process)

As mentioned before, Sanjay Kotecha’s A Cloud Guru course is a definitive resource for studying for the exam. I thought his content was spot-on, informative, and pertinent. He covers concepts that are relevant to day-to-day usage while not necessarily appearing on the exam, and he’s very clear in the delineation between these two separate categories of content. Nonetheless, I’ll still provide my own list of concepts that I encountered on the exam. From top to bottom, I’ve outlined the concepts from most important from a scoring perspective to least (e.g. more questions to less questions) as I encountered them; within each service, the same principle applies.

  • Redshift

    • Distribution key design
    • How encryption is achieved and what the implications are of encrypting or not encrypting at creation time
    • Distribution key design
    • WLM
    • Efficient data loading practices
    • Distribution key design
    • (I hope you see a pattern here)
  • EMR

    • Use-cases for transient vs non-transient clusters
    • EMRFS: when it makes sense and when it doesn’t
    • Hive
      • General usage patterns
      • How to store data in S3 to achieve efficient partition schemes
    • Spark(SQL) vs Hive vs Presto
    • MLlib vs Amazon ML: scenarios for when one is appropriate vs another
  • Kinesis

    • What the KCL and KPL are and when they are leveraged
    • How they leverage DynamoDB
    • Ingestion patterns/use-cases (vs something perhaps provided by another service, i.e. an S3 event notification)
    • Differences in use-case between Streams and Firehose
    • Services integrated with directly
    • Differences in available security features between Streams and Firehose
  • Machine Learning

    • Binary classification vs multi-class classification vs regression: when to use each model based on data set and type of decision outcome needed
  • DynamoDB

    • Implications of key construction on performance
    • GSIs vs LSIs
    • Implications on partitioning and performance by changing WCUs/RCUs post-creation
    • Streams
  • IoT

    • Rules and actions; what the differences between the two are and when to use them
  • Quicksight

    • Appropriate visualization choice for a given dataset or analytic requirement
  • Lambda

    • Integration patterns with other services
    • Usage patterns for Lambda-specific workflows (e.g. non-analytical file and data processing)

As I’ve taken these exams and achieved these certifications, people have often asked me what my “secret” is. While I don’t really think I have any insider knowledge here (just study until you know the material), I guess my one trick is this: stop talking about it and simply sign up for the exam. That’s an alias for commit to take the exam. And don’t sign up for a date two months from now. Sign up for a date a max of two weeks from the date you actually sit down and register for the exam. This has always separated those times when I just talk about taking an exam and when I actually prepare seriously for the exam.

comments powered by Disqus