Mastering No-Fail dbt Models With Unit Testing and TDD
Transform Your Data Projects into Error-Resistant Masterpieces with This In-Depth
In 2018, we decided the classic ETL process with PySpark wasn't working for us. It was too cumbersome for a small team that wanted to move fast. Not to mention how hard it was to support the entire infrastructure.
So we started a migration to the modern data stack (was that a term back then?) with Snowflake and dbt.
We knew what we needed and had a solid execution plan, but here's the truth:
Your stakeholders don't care what process you follow as long as you provide them with the correct data on time.
Now put yourself in my shoes:
Would you risk breaking existing processes and losing people's trust?
Your stakeholders rely on you to deliver accurate data on time. And you don't want to explain how they can't do their job just because you don't like PySpark.
We started with some research. Read the dbt documentation and all the other resources we could find online. We learned how data teams work on their dbt projects and how they make sure they deliver quality work.
Our research showed dbt tests were not enough. We needed a solid process to guarantee we would provide top-notch quality.
As people with a software engineering background, we opted for unit tests and a test-driven development process known as TDD.
This is the first of a two-part article on unit testing with dbt. In this issue, you’ll learn the theory behind unit testing, and in the next one, you’ll get your hands dirty.
Here’s what you’ll learn today:
Your Current Process
You Are Doing It Wrong
Unit Testing 101
Benefits of Unit Testing
Test-Driven Development
Reading time: 8 minutes
Your Current Process
Picture this:
You've got a coffee machine. You expect your machine to produce the exact latte macchiato as long as you use the same ingredients every time.
If you change the coffee brand, your latte will taste differently. If you use expired milk, you'll need to throw away your daily dose of delight. And if you forget the water, you may break the entire coffee machine.
The same applies to your dbt models:
Your models should output an exact copy of a table if you feed them the same source data. Also, they must fail if you provide broken data.
And how do you build your models to ensure they do what they need to?
You probably do that.
Write some SQL code
Test manually
Iterate on the first two steps until the result table looks fine
Add some data tests to check for unique and not-null values
Do the entire iteration until everything's green
Deploy the model
And if you’ve read any of the best practices promoted by the community, you know that’s the exact process most data and analytics engineers follow. It’s straightforward and makes sense.
But here’s the crux:
You Are Doing It Wrong
Although what we just outlined as a process is de facto the standard, it doesn't make it right. Whoever suggested that process assumed you don't have control over what data you get or that you are just too mediocre to follow a better process.
And before I tell you what your process should look like, here's a short list of the most significant issues you currently have:
Design Flaws
Let's be honest:
How many times have you been sucked into the code and forgotten its purpose?
Me too! I can't count how many times I added features my stakeholders didn't need just because I may need them in the future.
So, instead of focusing on requirements and building something simple and valuable, I focused on code and built complex and confusing models.
If you are like me, you've spent days or even weeks identifying the flaws in your designs.
Ridgid Logic
Let's face it:
While you work on a model, only you and the gods know how this model works. After you deploy it, the gods remain the only ones who know how this model works.
During the development process, you test everything thoroughly and make sure you provide the best possible value. But more often than not, the result is a gigantic piece of SQL that's too hard to understand.
You feel the pain in your gut whenever somebody requests a change in that model. It's not just the logic but all the edge cases you need to test.
In my experience, manual spotchecking and basic dbt tests guarantee you'll mess things up when you modify complex models.
But wait, there's more!
Testing Too Late
Let's get back to our coffee machine example.
What if the manufacturer sent somebody to sip every cup of coffee you make?
They want to be sure you use the right coffee beans and the perfect water temperature. The coffee machine is the same; they want to ensure you have the ideal experience.
I mean, that's what you probably do with your dbt models now!
You don't really test your model. Instead of ensuring your model does what it is supposed to, you test the incoming data.
But testing the data after it lands in your warehouse is too late. This approach is not just expensive, but it's also hard to fix. Not to mention, you are focusing on the wrong problem.
Enough problems; let's talk about the solution.
Unit Testing 101
If there's one thing you need to remember from this article, this is it:
Don't test your raw data. Instead, test your models.
To circle back to the coffee machine example, you don't want the vendor to test your ingredients. You want them to test the unit before it leaves the factory. You also want them to quickly repair it if needed and give you instructions on using that appliance.
You want to act as a coffee machine maker. You need a reproducible testing process, good documentation, and a cheap and easy way to build new product versions.
To do that, you need to eliminate all variable components and test as early as possible. This approach is straightforward and requires just two components.
Input Data
Instead of testing the results of your transformations after each run, you should put a gate. We call this concept data contracts nowadays, and we can talk more about it in another
issue.You also don't want to test your model's behaviour with production data, not only for security reasons but also because this data is non deterministic—it's always different.
Instead, create some dummy raw data manually. Think about all the scenarios your model needs to handle. Reuse the same mock data every time you test your models. Make sure your fixtures evolve as your model's expected input does.
Output Data
You know exactly what your input mock data looks like. You also know how your model needs to work. That means you know precisely what the output will look like when you feed your model with your static data, also known as fixtures.
You can build your models using your dummy data and compare the results with your expectations.
This concept is powerful, and here's why:
Benefits of Unit Testing
Let me start with breaking a misconception. Most people believe unit testing will slow them down. You indeed need time to collect all requirements and craft good unit tests, but that time is not wasted.
In fact, your development process speeds up a lot after you build the tests. At the end of the day, those tests run extremely fast and eliminate manual labour.
But speed is not everything. You may need to change your model's internal behaviour entirely. And because your fixtures and expectations are static and can handle all scenarios, you can be sure you didn't break anything.
Even better, you can change any upstream dbt models and be confident you won't break the entire dbt DAG.
For example, a study conducted at IBM reported a 40% drop in defect density, and Microsoft reported a whopping 60–90% drop due to the use of test-driven development.
On top of that, your test can act as documentation. Everybody in the team can check what data the model works with and what data it outputs.
Also, because your fixtures and expectations are way smaller than your production data, you can run your tests as often as you want without worrying about the bills.
In my experience, the most significant benefit is that you don't need to break your production data to find out you have messed up. This doesn't just boost your confidence; it also increases your stakeholders' trust in you.
And these are just some of the benefits of untrained testing. There are so many more!
Instead of turning this newsletter issue into a unit testing appraisal, let's talk about how unit testing can improve your development process.
Test-Driven Development
Earlier in this article, we outlined the standard development process most data and analytics engineers worldwide use. It's cumbersome and resembles the classic waterfall software engineering process.
Unit testing allows you to implement a more agile process with a much shorter feedback loop:
Craft your test's fixtures and expectations
Write some SQL code
Run the unit test
Iterate on the previous two steps until the result table looks fine
Deploy the model
This process is called test-driven development, or TDD. The process lets you focus on the requirements upfront and confidently build your code according to those requirements.
And because the process is so fast, engineers often set up their IDEs to run the tests automatically every time they save a file.
As you can guess, following TDD requires a mindset change. You need to stop thinking about a tedious task that you'd rather skip. Instead, it tests a lighthouse that guides you towards data excellence.
Summary
This week, you learned what process most data and analytics engineers follow when working with dbt. You also learned why this process is terrible for your productivity and how it impacts stakeholder relationships.
Then I told you what unit tests are, why testing is better than the classic DBT process, and what components you need to create a unit test.
At the end, I introduced you to the powerful test-driven development concept. I explained how you can embrace dbt and confidently interact faster with your models.
Stick with me, because next week I'll give you the exact blueprint on how to implement unit tests into your dbt project.
Did you enjoy that piece? Follow me on LinkedIn for daily updates.
Lol, I accedentaly voted with "Hate it", and I can't remove my vote now!