Demystifying Data Flow: ETL and ELT Explained Simply
Uncover the Key Differences and Pick The Right Approach for Your Next Data Product
You're building a new data product. Should you pick ETL or ELT for your data pipelines? Don't worry if you're unsure—I've got you covered!
Many data professionals struggle with these concepts, but mastering them is crucial for effective data management and analysis.
In this two-part series, I'll explain ETL and ELT in a clear and understandable way. By the end of this article, you'll have a clear grasp of these essential data processing methods and know precisely when to use each one.
In Part One, we'll cover the basics:
What ETL and ELT actually mean
The key differences between the two approaches
Real-world use cases for both ETL and ELT
For part two, I have prepared an extensive comparison of these two techniques.
Let's get started!
Reading time: 8 minutes
📣 Do you want to advertise in Data Gibberish? Book here
📖 What's the Difference? ETL vs ELT Explained
ETL and ELT are ways to move and process data, but they do it differently. Let me break it down for you.
ETL stands for Extract, Transform, Load. ETL works in three steps:
Extract: You pull data from different places like databases, APIs, spreadsheets, or log files.
Transform: You clean up the data and get it in the right shape for your needs.
Load: You put the cleaned-up data into the data warehouse.
Now, let's look at ELT: Extract, Load, Transform. The steps are the same, but the order is different:
Extract: Just like in ETL, you pull data from your sources.
Load: You put the raw data straight into your database.
Transform: Only after you've loaded the data do you start cleaning it up in the warehouse.
The main difference is where and when you clean up the data. In ETL, you do it before you load the data. In ELT, you do it after in your database.
Which way is better? Well, it depends on what you need, what resources you have, and what you're trying to do. That's precisely what you and I explore next.
🧭 When to Use ETL vs ELT: A Practical Guide
Now that you know the basics, let's talk about when you should use ETL and when ELT might be better.
Some people think ETL is old-fashioned, but it still has its uses. You need to think about what you need, what systems you have, and what you want to do with your data.
ETL Use Cases 🐢
Let's start with ETL. People have been using this approach for a long time, and it's still popular. My first experience in data was building ETL jobs with PySpark. Here are some times when ETL works well:
When your database isn't very powerful: If your data warehouse can't handle a lot of work (Did I hear Redshift?), ETL can help by doing the heavy lifting before the data gets there.
When you need to be really careful with your data: If you must ensure that your data is clean and follows all the rules before entering your system, ETL allows you to check everything thoroughly.
When storage costs matter: Storing all your data as is can be expensive when using cloud solutions. ETL can lower this cost significantly.
One of our main user-facing products relies on ETL. We pull tons of data from third-party APIs, transform it, run some analysis on top of this data, and then store the results in a MariaDB.
One requirement is to do that not only on a schedule but also when the customer proactively wants to recalculate these numbers.
ELT Use Cases ☁️
Now, let's look at ELT. This newer approach is more popular, especially since the modern data stack hype started around 2018. Here's when you might want to use ELT:
When using cloud-based data warehouses: If you're using platforms like Snowflake or BigQuery, they are built to handle ELT well.
When you want to keep all your raw data: If you want to keep your options open, ELT lets you store all your data and clean it up later in different ways as your needs change.
When you need up-to-the-minute information: ELT can be faster for getting raw data into a datalake or lakehouse.
One of my favourite use cases for ELT is batching event data. When we started our journey with events, we used ETL. This process was working very well until it didn't.
Streaming your events, transforming them into data frames, and loading them in tables is easy. However, we occasionally found that the events were changing, or we wanted to extract more fields from them.
With ETL, you have no choice. You leave existing data as is and update the behaviour for the new data.
That is why we moved our event processing to ELT. Now, we load raw JSON data into the warehouse. This allows us to reprocess old events and provides schema evolution history.
🐸 Hybrid Approaches: The Best of Both Worlds
Feeling torn between ETL and ELT? The good news is, you don't have to choose!
Many organisations are finding success with hybrid approaches that offer the best of both worlds. This practical solution reassures you that there's a way to navigate the complexities of data processing.
A hybrid approach is what I like to call "ETLT" (Extract, Transform, Load, Transform). Here's how it typically works:
Pull data from your source systems.
Do some initial cleaning (like removing sensitive data or fixing apparent errors).
Load this partially cleaned data into your database.
Do more cleaning within the warehouse as needed.
This approach has some significant advantages. It ensures critical cleaning happens before data enters your system, allowing you to keep flexibility for later cleaning. The work is split between ETL tools and your data warehouse.
I've found this hybrid technique effective. It's given me great results in my data projects. And to be completely transparent, the team I am a part of uses a lot of ELTT (yes, two transformation layers!) nowadays.
💡 Tips for Beginners: Getting Started with ETL and ELT
If you're new to data engineering, jumping into ETL and ELT can feel overwhelming. But don't worry – everyone starts somewhere. Here are some tips to help you get going:
Start small: Don't build a complex data pipeline immediately. Begin with a simple project, maybe moving data from one source to a target and doing basic cleaning.
Try out free tools: Many free tools are available for both ETL and ELT. Give Apache NiFi a shot for ETL or dbt for ELT transformations.
Get good at SQL: Strong SQL skills are crucial in data engineering, regardless of your chosen approach. Make sure you're comfortable with complex queries and data manipulations.
Know your data: Before you start building pipelines, take time to really understand your data sources. What kind of data are you dealing with? How often does it change? What cleaning does it need?
Follow good data practices from the start: Learn about data quality, privacy, and security best practices. Building these in from the beginning is easier than adding them later.
Getting good at ETL and ELT takes time and practice. Be patient with yourself, and celebrate your wins (even small ones). Don't be afraid to make mistakes – that's often where the best learning happens!
🏁 Summary
You learned the basics of ETL and ELT. But this is just the beginning of your journey.
Remember these key points:
There's no one "right" choice between ETL and ELT. The best approach depends on what you need, your resources, and your goal.
Both ETL and ELT have their strengths and challenges. Understanding these will help you make smart decisions.
Start small, but think big. Begin with simple data projects, but always keep your long-term data strategy in mind.
You'll face challenges whether you choose ETL, ELT, or a mix of both. But you'll also have the chance to build systems that turn raw data into valuable insights for your organisation.
So, what's your next step? Maybe it's trying out an open-source ETL tool or experimenting with ELT transformations in your data warehouse. Whatever it is, I encourage you to dive in and start exploring.
Until next time,
Yordan
📚 Picks of the Week
This piece on data quality may sound controversial initially. But as always, Ergest is completely right. (link)
- has some awesome memes. One of his spicier ones was too spicy for LinkedIn. (link)
What it takes to become an analytics engineer at Airbnb?
has the answer. (link)
Did you enjoy this article? Hit the ❤️ button or share it with a friend or coworker. 🙏🏻