AWS for Data Engineers: Conquer the Cloud in 90 Days
A 30-60-90 plan to kickstart your cloud journey with key AWS tools
The cloud has transformed how we handle data, and AWS has been at the forefront of this change. With a market share of more than 30% and a suite of 200+ services, AWS has become the go-to platform for data engineers.
But where do you begin? How do you navigate this ecosystem and harness its power?
I crafted a comprehensive 30-60-90 day plan to help you conquer AWS dive in the modern cloud.
This is not a course, not even a tutorial. What you will read here is a guide. A game plan if you wish. I am leaving all the reading and practice to you. After all, there's no sense in bragging about a project developed by somebody else.
That said, please let me know how it is going for you. Ask questions, and share your progress. I want you to succeed in your learning.
Reading time: 7 minutes
🗺️ The Plan
Now, let's be clear. You cannot become an AWS expert in only 90 days. But I can promise you can become competent. You will need to dedicate a significant amount of your time. I recommend spending at least 10 hours per week.
But wait, there is more! I prepared a FREE Notion template for you to duplicate and track your learning progress.
Also, remember that not every service in AWS has a free tier. It might cost you some cash to run everything. There are free tools like LocalStack that emulate AWS well. Yet, I recommend using the real thing and shutting down your services when you finish using them for the day.
🎒 Prerequisites
There's nothing you need to know upfront. Yet, it would help if you know how to use a text editor and have some understanding of Python and SQL.
Also, it would help if you already know about a project you want to build. In a perfect world, you'd have a way to produce some data. You can use IFTTT to track your GitHub activity, phone calls, or even how often you go to the bathroom.
If you don't want to spend time producing data or wish to avoid tracking personal data, you can rely on AWS. Browse the Registry of Open Data on AWS, pick some sources, and think of an exciting project to build.
Now, let's discuss the plan.
🥚 Days 1-30: AWS Basics
The first month of your project is about understanding the basics. Here, you will learn some core services and understand how AWS works.
Project Part #1: Build a Data Lake on S3 🪣
The first and most important service you must learn is S3. S3 is not yet another service for file storage. It is de facto the standard for cloud storage. No matter what you do in your career, you will use the S3 standard.
Here is what you need to do to learn S3:
Learn S3 basics - buckets, objects, storage classes, permissions
Create an S3 bucket and upload sample datasets
Define a folder structure to organise data like a data lake
Configure lifecycle policies to transition data to cheaper storage
Project Part #2: Analyse Data with Athena 🔎
That's why you should start analysing the data you stored in S3. You will begin with Athena for the analyses. Later, you will transition to QuickSight to make a pretty dashboard.
The second part of your 30 days is a bit easier from a technical point of view but more complex from a UX standpoint:
Learn Athena basics - databases, tables, querying data
Create an Athena database and define table schemas on top of the S3 data
Write SQL queries to analyse the data and output results back to S3
Visualise query results in QuickSight
Once you finish the first 30 days of your project, you will have a functioning data platform. You will have a way to store data and a way to analyse it. Let's meet at the intermediate level.
🐣 Days 31-60: Intermediate Skills
During the second month of your AWS journey, you will explore data engineering territories. You will learn everything you need to know about ETL and warehouse services.
Project Part #3: Data Processing Pipelines with Glue and Kinesis 🚰
Here, you will learn about catalogues and transformations with Glue. You will also learn how to stream data and process events Kinesis.
Here's Here'san for Part 3:
Learn Glue basics - crawlers, jobs, workflows
Use Glue Crawlers to discover schemas and create a Data Catalog
Write Glue ETL jobs in Python to transform data from landing to processed zone
Learn Kinesis basics - streams, shards, producers, consumers
Ingest streaming data into Kinesis and use Kinesis Firehose to land it in S3
Trigger Glue jobs to process the streaming data using Kinesis Analytics
Project Part #4: Data Warehousing with Redshift 🗄️
Part 4 is all about warehousing. With Redshift, you can make your reports faster and cheaper. It's not the best warehouse for larger organisations, but it is great for smaller projects.
Check the plan for Part 4:
Learn Redshift basics - clusters, nodes, distribution styles
Provision a Redshift cluster and create tables in it
Use Glue to ETL data from S3 into Redshift
Write efficient SQL to query and join large tables
Connect QuickSight to Redshift for reporting and dashboarding
Finishing the second 30 days of your AWS learning means you understand AWS well. You can be confident in your skills and showcase your project.
🐥 Days 61-90: Advanced Topics
You will step into advanced areas during the last month of your learning. Most of what you will do here is to improve efficiency and decrease costs.
Project Part #5: Big Data Processing with EMR ✨
In Part 5 of your AWS learning project, you must focus on ERM and Spark. As mentioned, you can run Spark with Glue, but EMR gives you much more control over the infrastructure.
Let's cLet'syour learning plan:
Learn EMR basics - clusters, nodes, steps
Provision an EMR cluster with Spark
Write PySpark to process data in S3
Optimise Spark for performance through partitioning, file formats, caching
Output aggregated results to S3 and visualise in QuickSight
Project Part #6: Cost Optimisation 💸
You know how people say the devil is in the details? This saying applies to the last part of your project. This part is about taking small steps to reduce costs and show your AWS mastery.
Learn to track and analyse costs using Cost Explorer and Budgets
Identify the most significant cost drivers and implement optimisations
Right-size Redshift and EMR clusters based on workload
Leverage Spot instances and Auto Scaling to reduce compute costs
Delete unused resources and automate cost-saving practices
🏁 Summary
By the end of the 90 days, you will have built an end-to-end data platform on AWS. You will have everything from ingestion to storage to processing to visualisation. You'll have hands-on experience with the core services. You will also know how to optimise for cost and performance.
But this is the beginning of your AWS journey. The cloud evolves, and there's more to learn and explore. Keep pushing yourself to stay up-to-date with the latest services and best practices.
Take the skills and knowledge you gain over these 90 days and put them into practice. Build something unique, solve real-world problems, and showcase your expertise. The cloud is your playground, and the possibilities are endless.
📚 Picks of the Week
I’ve been planning an article about setting up your data engineering dev environment. This piece by
is exceptionally good. (link)Also, the latest article by
is focused on DataOps. This is all we talk about here, too. (link)And, as we are talking about learning roadmaps, here is an outstanding text by
. It’s a must-read if you want to break into data in 2024. (link)
😍 How Am I Doing?
I love hearing from readers and am always looking for feedback. How am I doing with Data Gibberish? Is there anything you’d like to see more or less of*? Which aspects of the newsletter do you enjoy the most?*
Hit reply and say hello. I’d love to hear from you!
Nice article! Those visualizations really help understanding the general idea and flow!
Thank you so much for the shoutout, Yordan!