The Ultimate Guide to Unit Testing With dbt
How to become a DataOps unicorn with test-driven development
Do your stakeholders come to you and tell you there’s something wrong with their reports?
Do you feel anxious when you need to modify the logic of a complex dbt model you haven’t touched in months?
Do you often find yourself in a situation where your new model doesn’t comply with the requirements after you spend hours working on it?
Believe it or not, most analytics and data engineers would answer “Yes” to these questions immediately. The chances are your answer is positive, too.
Let’s be honest, you have a severe problem!
You work slowly and produce buggy code that doesn’t align with colleagues’ needs. People can’t rely on you to provide quality work and don’t trust you. You are a liability.
But fret not! There’s a well-known process explicitly designed to target these issues. Test-driven development with unit tests is a practice widely used by software engineers but not popular among data professionals.
By implementing unit tests, you will improve the communication with your partners in the business. You will also be more confident and iterate faster on your tasks. You will be among the few world-class data professionals who address issues with their models upfront.
In the last issue of Data Gibberish, you learn what unit tests and TDD are. In this article, we are discussing:
Short recap of the previous article;
What components unit test need;
Step-by-step tutorial on implementing unit tests with dbt;
Improving your DataOps process.
Reading time: 10 minutes
Recap
As usual, I strongly recommend reading my last article. Yet, here's the essence:
You, as most dbt community members, only test data in production. This means you learn about mistakes after you load faulty data in our warehouse. This practice is slow and expensive, but most importantly, it impacts your people’s trust in you.
Unit tests are specific kinds of tests that don't exist natively in dbt. They ensure that you comply with your requirements and that your models’ behaviour does not break in the future.
You can revert your process and test before you start writing code. This process is called Test-driven development, or TDD for short. It’s steps on unit tests but also helps you clarify specifics and avoid design flaws upfront.
I wrapped up my last article with a promise to show you how you can implement unit tests in our dbt projects. And here we are.
Overview
I want to spread the word about unit tests as much as possible and give you a complete step-by-step plan for implementing unit tests with dbt. This is not yet another talk about theory; this is a story. It’s precisely what allowed us to move from ETL to ELT stack and processes in under six months.
But wait, there's more! Random snippets of wisdom and code are not enough here. I added unit tests to the famous Jaffle Shop project from dbt Labs, so you can try everything out directly without copying and pasting the code from the article.
https://github.com/ivanovyordan/jaffle_shop
Here's what you'll do if you want to steal my unit test blueprint today:
Create a CI dbt profile;
Install dbt Utils;
Add Dummy Data;
Add Expectations;
Run the Tests.
Now, let’s step even further. Consider this issue of Data Gibberish as a hands-on workshop on unit testing with dbt.
Let's get this show on the road!
Milestone 1: Create a CI dbt Profile
Most dbt projects I've seen have two targets:
dev
for your local development environment;prod
to build your production data and make it accessible to the world.
Effective unit tests require an isolated environment. This isolation means no external factors are affecting your models' behaviour. They need a lab environment if you wish.
To create that environment, open your profiles.yml
file (usually ~/.dbt/profiles.yml
) and add a ci target to your project. You can find the sample profiles file in my fork of the Jaffle Shop project. And this is what the CI target looks like:
In this case, I work with a DuckDB database. This means you can run the tests for free using a single file. If you work with a proper warehouse, say Snowflake, you can create a personal testing schema next to your dev schema.
Milestone 2: Install dbt utils
Before working on your unit tests, you have to install the dbt utils plugin.
dbt utils is a package with some handy macros and tests. For this step, you need the equality
test. However, I have never seen a dbt project without this plugin, and I firmly believe it's a must-have no matter what.
To install it, add this line to your packages.yml
file. Here's what my Jaffle Shop file looks like:
Once you save that file, open your terminal and run this command in your project's folder:
Now, buckle up because this may be pretty challenging.
Milestone 3: Add Dummy Data
You can now start with generating our mock input. But this is not just a random subset of your data. You need to be strategic when working on that piece. Collect the requirements and understand what your data means.
As I told you, you should not work with your production data. Instead, you need just a handful of handcrafted examples. Think about every corner case and manually create your version of the input data.
For example, your Jaffle Shop customers can split their orders among different payments.
And here’s our dummy data with order 9
split into two payments.
You need to create a test directory in your data folder or wherever you store seeds and save your fixtures as CSV files. But remember, your seeds must have the same names as your raw tables. And again, you can find the actual file on GitHub.
But we are not through yet. There's one problem you need to handle with your mock data.
If you run dbt seed, you will create tables in your data warehouse with the names of those files. You want this data to be available only when you run your tests and not mess with your data on production. To do that, you'll benefit from the ci
profile from the previous section.