Unlocking dbt: Everything Data and Analytics Engineers Need to Know
Learn how dbt can transform your data workflows, boost efficiency, and drive business impact. Discover its applications, best practices, and future trends to elevate your data engineering skills.
Hi there,
What is dbt, and what does it have to do with data engineering?
As a regular visitor of the r/dataengineering subreddit, I see this question every few months.
Most online sources will give you a one-line explanation. Unfortunately, this is not enough.
Although its concept is simple, dbt can solve a lot of complex problems. It unlocks short development cycles and good software engineering practices in an area where these never existed.
Now, brace yourself, as you and I will discuss dbt in a rather lengthy article. Today, you will learn what dbt is, why it matters for organisations, some good and bad practices and much more.
Also, my mates and I created something free for you. Read along to learn more.
Reading time: 14 minutes
Introduction to dbt
ETL and ELT
If you know anything about data engineering, you must have heard the term "ETL". That's the process most data engineers use to empower the business to make data-driven decisions.
Now, in the last few years, companies started shifting towards an ELT. This means you load raw data directly into your data lake/warehouse/lakehouse. At some point later, you transform this data into the same system.
Do you want to learn more about the ETL and ELT processes? Check my miniseries where I explain the topic extensively:
dbt’s Role in ELT
dbt is a command-line tool that transforms data in your warehouse by compiling and running SQL select statements. It's designed to bring software engineering practices to analytics, introducing version control, testing, and documentation to data transformations.
dbt is not an ELT, too. It's only responsible for the final T step in this process.
Consider this typical scenario: You have raw sales data in your warehouse, but you need to calculate monthly revenue by product category. Traditionally, you might write a complex SQL query or, God forbid, a stored procedure, and I hope you remember to update it when needed. With dbt, you'd create a model (a SQL file) that defines this transformation.
dbt handles the execution, testing, and documentation of this model. This approach leverages the power of modern data warehouses. It allows for more flexible and efficient data transformations.
For example, instead of transforming sales data before loading it into your warehouse, you'd load the raw data first. Then, using dbt, you'd create models to clean the data, calculate metrics, and prepare it for analysis. This approach allows you to easily reprocess historical data if your transformation logic changes.
The Economic and Practical Impact of dbt
Rethinking Resource Allocation in Data Engineering
Traditional data engineering often requires substantial resources:
High-cost experts capable of writing code and managing complex ETL clusters
Significant server costs for maintaining these clusters
Extended development timelines due to the complexity of the systems
dbt offers a paradigm shift in this resource allocation. It streamlines the data transformation process by eliminating the need for separate Spark clusters.
Now, dbt doesn't eliminate these costs entirely but shifts them to your data warehouse. But you have one in place anyway.
It's worth noting that dbt still requires orchestration. Yet, this is often simpler and less resource-intensive than managing a Spark cluster.
Looking to take your networking and knowledge-sharing to the next level?
My buddies -
, , , and I - felt the same way. That's why we've created an exclusive new Discord community, a space where you can engage in lively discussions about software and data engineering, leadership, and the creator economy.Join us for Q&A sessions, virtual meetups, and special events, or take charge and host your activities. This is the perfect opportunity to connect with like-minded individuals and learn from each other.
We are just opening the gates. Become among the first members today and elevate your professional journeys together!
The Skills Factor: Democratizing Data Transformation
One of the most significant advantages of dbt lies in its accessibility. Finding professionals who can manage complex ETL pipelines is challenging and often expensive.
From personal experience, I can tell you that it's far easier to find individuals proficient in SQL than those who can handle ETL systems. Not to mention the difference in wages.
dbt effectively lowers the barrier to entry for data transformation work. It allows organisations to leverage a wider pool of talent, potentially reducing hiring challenges and associated costs.
Beyond Simple SQL: The dbt Advantage
I know what you are thinking: "I could write SQL without a fancy tool. How is dbt better than anything else?" This is a valid question, but it overlooks several critical factors.
First, let's not pretend Stored procedures are good. They are evil, and they must be forbidden by law. SProcs have all the bad traits a developer can think about:
They are difficult to debug
Deployment can be challenging and error-prone
They lack version control, making it hard to track changes and collaborate effectively
Yes, you could store a set of SQL files in a directory and run them with Python, but dbt offers much more. Let me tell you more about that in the next section.
Core Principles and Mechanisms
Jinja and Macros
dbt allows you to mix SQL with a templateing language called Jinja. This turns your weak SQL into a UberSQL. With Jinja, you can:
Define variables
Write for loops and if statements
Work with the output from other queries
But that's not all. You can write your one reusable SQL in the form of dbt macros. So, instead of copying and pasting code, you can just invoke your macro.
I've worked on data engineering projects where all the logic was condensed into a handful of "user-facing" macros. When a data engineer ingests a new source, all they need to do is invoke one of these macros with the name of the source table as an argument. The macro handles all the annoying stuff for them, like type casts and cleaning dirty values.
Learn more here.
Models
In dbt, models are the basic building blocks. They are essentially SQL SELECT statements that transform data.
Models can reference other models via the integrated ref macro. That way, you create a dependency graph that dbt uses to determine the execution order.
Learn more here.
Materialisation Strategies
dbt offers four materialisation strategies:
Table: Creates a new table with the transformed data.
View: Creates a view of the transformed data.
Incremental: Updates existing tables with new or changed data.
Ephemeral: Exists only during the run and is used in other models.
You can also define your own materialisations if you need something fancier.
Learn more here.
Testing
Testing ensures the integrity of your data transformations and catches issues early in the development process.
dbt's test can range from simple null checks to complex data quality validations. You can also integrate frameworks like Great Expectations or one of the many data quality services.
Documentation
Built-in documentation features allow you to describe your models and columns. This creates a centralised knowledge repository, crucial for team collaboration and knowledge transfer.
dbt can then generate a static website with all this documentation. This website really makes it easy for team members to understand the data models.
Learn more here.
Technical Implementation
dbt is a Python application that reads your project files, compiles them into SQL, and executes this SQL against your data warehouse. The compilation process involves parsing model files, resolving dependencies, and generating optimised SQL for your specific data warehouse.
Note: dbt doesn't really compile your files. It transpiles them, but this is the word dbt Labs decided to use.
Here's a simplified view of what happens when you run dbt:
dbt reads your project files and configurations
It parses your models, macros, and other SQL files
It resolves the dependencies between models, creating a directed acyclic graph (DAG)
It compiles each model, replacing Jinja templating with actual SQL
It executes the compiled SQL against your data warehouse in the correct order
It runs any specified tests
It generates documentation
The dbt CLI is the primary interface for running models, testing data, and generating documentation. Common commands include:
dbt run
: Executes all models in the projectdbt test
: Runs all tests in the projectdbt docs generate
: Generates documentation for the project
Best Practices
Organise your models logically. Use a consistent naming convention and grouping related models in subdirectories.
For example, you might use the screaming architecture I wrote about a while ago.
This structure makes it easy to understand the purpose of each model and how data flows through your transformations.
Write tests. Start with simple uniqueness and not-null checks, then add more complex data quality tests. A good example is to test that the total revenue in the orders table matches the sum of item prices in the order items table. https://datagibberish.com/p/intro-to-unit-testing-for-dbt
Document your models thoroughly. Include descriptions of what each model does and explanations for important columns.
Use version control effectively. Commit changes regularly and write clear commit messages.
Choose materialisation strategies wisely based on your specific use case and performance requirements.
For example:
Use tables for your final fact and dimension tables you query frequently
Use views for simple transformations or when you always need real-time data
Use incremental models for large fact tables that you append to daily
Common Pitfalls
Overusing Jinja can make models hard to read and maintain. While Jinja is powerful, complex logic can be difficult to debug. Consider moving complex logic to Python models or macros if it becomes too convoluted.
Ignoring performance considerations. For instance, avoid reusing CTEs for large datasets in BigQuery. As CTEs, they're not materialised, which can lead to repeated computation. Instead, consider using materialised models.
Hardcoding values in models instead of using variables or macros.
Instead of:
Use:
Real-World Applications
You can use dbt nearly for any data product that can be done with SQL. Here are a few examples I have worked on in the last few years:
Data Cleaning and Source Combination
This is a pretty standard data engineering ELT project. But instead of using Python to do all the annoying jobs, my team and I leveraged SQL.
Think about filtering out invalid input, proper typecasting and creating columns out of other columns. You also have a few very similar sources that you need to combine and make them look like one.
All of this is a breeze with dbt.
Business Intelligence Support
Nothing fancy here. We used dbt to define all dimensions and facts in SQL. No more Python or clicking in the BI tool.
So, we can track how objects like Account, Contract, and Lead evolve over time. All of that with SELECT
and JOIN
clauses.
But that's not all. You can also define key metrics like Monthly Recurring Revenue (MRR), Customer Acquisition Cost (CAC), and Customer Lifetime Value (LTV).
Operational Data Needs
Supporting the operational needs of a production team was one of the most challenging dbt projects I've worked on.
The team worked with loads of time series data. We needed to transform this data every 15 minutes and display the dashboards on some big screen in their office.
Unlike most other projects, we couldn't afford to refresh the data less often. We needed to find a way to run the transformations quickly without spending too much on cloud resources.
ML Feature Store
The last example I want to tell you about is something I did for a machine learning team. This team used Spark to build their data sets from S3. The process was slow and painful.
To speed things up, we developed a process to store the features in a PostgreSQL database. The biggest benefit here was that instead of building these all the time, we could do it only when we wanted to.
Many Other Possibilities
As you can see, dbt is not just for analytical workloads. You can use it for all sorts of use cases.
From data cleaning and business intelligence to operational data needs and ML feature stores. dbt's versatility shines through in various scenarios.
dbt's can handle complex transformations while maintaining code quality and documentation. All of this makes it a powerful tool for data teams across industries.
Comparison with Alternatives
At this point, dbt is the standard transformation layer for ELT. Yet, it's important to consider alternatives. Here are the two I have the most experience with:
Stored procedures:
Pros: Faster execution, closer to the data.
Cons: Lack of version control and testing capabilities.
Use case: No actual use case. Avoid at all costs.
Pros: Powerful for large-scale data processing, supports multiple languages.
Cons: More complex to set up and maintain. Can be overkill for simpler transformations.
Use case: Better suited for big data processing, especially when dealing with unstructured or semi-structured data.
SQLMesh and SDF are two other modern alternatives to dbt. Although I don't have a real-life experience with this, SDF seems more promising to me.
dbt's strengths lie in its large, active community, extensive ecosystem of packages and integrations, and its balance of power and simplicity. The dbt package hub, for instance, offers pre-built models for common data sources, speeding up development time.
However, dbt may not be the best fit for:
Extremely large-scale data processing where Spark might be more appropriate
Teams that strongly prefer visual tools over code-based solutions
Real-time or streaming data scenarios, where tools like Apache Flink might be more suitable
When choosing between dbt and alternatives, consider factors like:
The scale and complexity of your data
Your team's technical skills and preferences
Integration requirements with your existing data stack
The level of flexibility and customisation you need in your transformations
Did you find something helpful here? Be an ambassador. Like and share my latest LinkedIn post. Help your network learn more about the topic.
Final Thoughts
dbt is one of my favourite tools in the data engineering space. It allows you to quickly onboard new team members and move with your data products.
dbt has already won the startups and growth stage companies. It needs a bit more features to attract enterprise customers, but I can see it's getting there.
Just a few years ago, I had a chat with a data consultancy owner. They were pretty sceptical about dbt.
It might work in the US, but companies in Europe are more conservative. They will never adopt it.
Nowadays, they are one of the biggest dbt proponents in Europe.
I am not trying to say they were wrong, and I was right. My point is that things are changing fast, and even conservative businesses are seeing the value in fast and easy tools.
As data needs continue to grow in complexity, tools like dbt will play an increasingly crucial role. The ability to manage complex data transformations with version control, testing, and documentation is becoming not just a nice to have but a necessity for many data teams.
Summary
dbt has transformed how organisations approach data transformation, bringing software engineering best practices to the world of analytics. Its modularity, testing, and documentation emphasis have enabled teams to work more efficiently and produce quality data products.
Understanding and leveraging dbt can significantly enhance your data workflows. It can help you move from ad-hoc, difficult-to-maintain SQL scripts to a robust data transformation pipeline.
The key is to evaluate your specific needs and use cases. dbt might be the right tool for you, or you might find that an alternative better suits your requirements. Either way, the principles of modular, testable, and well-documented data transformations that dbt embodies are likely to shape the future of data engineering and analytics.
But remember, the tool itself is just one part of the equation. Successful implementation also requires:
A clear data strategy aligned with business objectives
Investment in team training and skills development
A culture that values data quality and documentation
Continuous evaluation and optimisation of your data processes
By embracing these principles and leveraging tools like dbt, you can build a more robust, efficient, and valuable data infrastructure for your organisation.
Until next time,
Yordan
Picks of the Week
Pinterest moved from Apache Druid to StarRocks. Learn why and how. (link)
I am trying to up my Kafka game.
’s latest article came out just in time. (link)Marketing attribution is among the most significant challenges in data. I’ve even heard people say you shouldn’t do it as it will be wrong anyway.
doesn’t share this opinion and has a great approach to solving the problem. (link)
How Am I Doing?
I love hearing you. How am I doing with Data Gibberish? Is there anything you’d like to see more or less? Which aspects of the newsletter do you enjoy the most?
Use the links below, or even better, hit reply and say “hello”. Be honest!
"Stored Procedures - Use case: No actual use case. Avoid at all costs." was my favorite part 🤣 amazing summary Yordan!
A quality article, as always. Thank you for the shoutout!