Testing Facepalms: 5 Common Mistakes We've All Made (and How to Avoid Them)

Testing doesn’t have to be trial and error. Get ahead of common pitfalls with this step-by-step guide to avoid common mistakes and improve your workflow.

Oct 16, 2024

This article is part of the Build rock-solid data pipelines playlist. Click here to explore the full series.

Greetings, curious reader,

Testing is a vital part of building reliable data platforms. But for many beginner data engineers, it can feel overwhelming. I get it—testing is not exactly the most exciting part of the job. And if you don't approach it correctly, it can become a roadblock, slowing development or adding to cloud costs. But here's the thing—testing doesn't have to be painful, and skipping it can lead to bigger headaches down the line.

Testing mistakes can range from writing too many unnecessary tests to not having a clear strategy for testing data pipelines. You and I will cover five common mistakes I've seen—and made myself. I'll also share some practical tips on how you can avoid those mistakes.

We'll dive into topics like test coverage, business impact, testing strategies for different pipeline stages, and the importance of keeping your testing environment simple.

Reading time: 10 minutes

Mistake #1: Testing Too Much

Let's start with one of beginners' most common mistakes: trying to test everything. It's easy to think that more tests mean better coverage. But here's the problem—adding too many tests can quickly become counterproductive.

When you test the same code over and over, you create unnecessary friction in your workflow. Each test adds time to your build process, and before long, you're dealing with painfully slow testing cycles. You'll find yourself waiting 20 minutes to push a small change, and the frustration can lead to skipping testing altogether.

Here's where things get even worse: cloud costs. Running excessive tests can add significantly to your cloud bill if you're using tools like AWS Glue or Snowflake for your data pipelines. These tools charge based on compute resources, and testing complex workflows will eat up your budget fast.

I've seen teams incur thousands of dollars in additional cloud costs simply because they didn't optimise their testing strategy. When you factor in the number of pull requests and growing teams, testing can become a hidden cost that grows over time.

So, how do you avoid this mistake? Focus your testing on areas that matter. Instead of testing every function or transformation, identify critical points in your pipeline that have the biggest impact on the business or are most likely to break.

For instance, focus on testing edge cases, business-critical transformations, and data quality checks that align with SLAs. Prioritise high-impact tests rather than spreading yourself thin with too many low-value tests. This keeps testing cycles shorter and ensures you're not throwing money into the cloud for redundant checks.

I shared with

Taylor Brownlow

a while ago how we lowered our Snowflake costs by more than 50%. A big part of this was optimising our testing processes.

Mistake #2: Mixing Low-Level and High-Level Logic

Another common mistake beginners make is confusing low-level and high-level logic when writing tests. In the world of data engineering, you need to understand what you're testing at each level of your pipeline. For example, if you're using a tool like dbt, you should know the difference between testing your low-level macros and your higher-level models.

Here's why this is important. Macros in dbt handle low-level reusable transformations like filtering data, applying conditions, or running calculations. You want to write unit tests to ensure these transformations work as expected.

Mastering No-Fail dbt Models With Unit Testing and TDD

Yordan Ivanov 📈

January 10, 2024

Read full story

On the other hand, your dbt models are often high-level business logic that transforms raw data into something meaningful. For these, you want to focus on integration tests that check how different parts of your data pipeline come together to deliver business insights.

This ties directly into the Testing Pyramid. The pyramid illustrates that you should have more unit tests at the base (low-level code) and fewer integration or system tests at the top (high-level workflows).

Unit tests should verify that each component works correctly in isolation, while integration tests should validate how everything works together. Beginners often mix these up, writing overly complicated tests that try to do everything at once, or they skip low-level testing, leaving potential issues undetected until later stages.

Ilustration of testing pyramid in data engineering — You need many fast low-level tests and just a handful of slow business-level tests

To fix this:

Follow the pyramid.
Use dbt's built-in unit tests for your macros and basic transformations.
Use integration tests to ensure your models produce the correct outputs and align with your business rules.

This way, you keep your tests simple, focused, and effective at catching errors before they make it into production.

Mistake #3: Ignoring Data Relevance

A common issue I see with beginner data engineers is not understanding the business impact of their data pipelines. You might be tempted to test all data equally, but here's the truth—not all data is equally important.

Some data pipelines handle mission-critical information like financial transactions, while others process less critical datasets, like internal reports or temporary logs. Failing to differentiate between these can lead to wasted effort, excessive alerts, or, even worse, missed errors in vital areas.

Why is this a mistake? If you test every pipeline with the same severity level, you'll likely end up with bloated test suites that slow you down and cause test fatigue. Not to mention the noise these tests can cause.

Instead, you need to define different levels of severity for your tests. For example, data pipelines that handle critical financial information should trigger a critical error and stop the process if they fail. On the other hand, less important pipelines—like those dealing with internal reports—can just issue warnings if something goes wrong, giving you time to fix it without impacting the business.

One reason this happens is that data engineers often don't fully understand the downstream impact of the data they're working with. Collaborating with your business stakeholders to learn what's most important is essential.

Data Team Building 101: The Surprising Trait That Beats Experience

Yordan Ivanov 📈

September 18, 2024

Read full story

You should know why each data pipeline exists and how its outputs are used. Armed with this knowledge, you can set the right priorities for your tests and ensure that mission-critical data is always protected.

Thanks for reading Data Gibberish! This post is public so feel free to share it.

Mistake #4: Overcomplicating the Test Environment

When it comes to testing, simpler is often better. One mistake I see all the time is teams over-engineering their test environments. They set up multiple types of tests for each stage of their pipeline—unit tests, integration tests, data quality checks, end-to-end tests, and more.

While thorough coverage is important, having too many test types can slow things down and create unnecessary complexity. Engineers often spend more time maintaining tests than actually delivering value.

A complex testing environment can lead to confusion and frustration. Engineers can get overwhelmed by the number of tests they need to write and maintain, leading them to either skip tests altogether or focus on tests that don't actually catch important issues. Also, a complex environment can make it harder to reproduce issues, leading to long debugging sessions and wasted effort.

Instead of trying to cover every possible angle with multiple test types at every stage, focus on creating a lean, effective test suite. Focus on a few high-value tests that cover the most critical areas, like data quality checks at the root, integration tests for your connectors, and end-to-end tests for your business metrics.

By keeping your testing environment simple and focused, you can spend less time writing and maintaining tests and more time delivering real value to your organisation.

Mistake #5: Not Having a Clear Testing Strategy

A common beginner mistake is testing for the sake of it. Data engineers often set up a few tests to check the box without thinking about what they want to achieve. This leads to tests that don't provide real value and often miss critical issues.

If you don't have a clear strategy for your testing efforts, you'll end up with scattered, hard-to-maintain tests that are ultimately ineffective.

Here's how you can avoid this. First, take a step back and think about your entire data pipeline. Break it down into stages and identify what's important to test at each one. For example:

At the source, focus on testing data quality. Ensure your sources produce quality and accurate data. Use tools like data contracts to enforce rules about the data you expect to receive.
At the integration layer, test whether your connectors can reliably pull data from sources and push it to your destinations. These tests should focus on connectivity and data flow, making sure the right data is getting to the right place.
At the transformation layer, focus on business rules. Test that your models and metrics align with the definitions you've agreed upon with business stakeholders. Make sure that data outputs meet expectations and deliver correct insights.

A demonstration of data layers and testing types relations; source data + data catalog; integration layer + data flow; Transformation + business rules — Different testing strategies work best for different data layers

It's also a good idea to set a target for test coverage and incorporate testing into your development process. Don't treat it as an afterthought. By doing this, you ensure that your testing is focused and meaningful rather than a collection of random tests that don't actually help you prevent errors.

If you enjoyed the newsletter so far, please show some love on LinkedIn or forward to your friends. It really does help!

Final Thoughts

Testing has come a long way in the past few years, especially in data engineering. With tools like dbt now supporting unit tests and the rise of data contracts, we're seeing an evolution in how teams approach testing in data pipelines. Testing tools are becoming more integrated, and there are more options than ever to test early in your pipeline before issues reach production.

However, the data space remains challenging. Unlike traditional software engineering, where testing tools are mature, data engineering still lacks robust end-to-end testing capabilities. Many teams still rely on testing after the data has already landed in its destination, which increases the chance of errors slipping through. What we need are testing frameworks that cover the entire data pipeline, giving engineers visibility into potential issues before they happen.

Looking forward, AI has the potential to revolutionise testing strategies. AI can help generate synthetic data for tests, reducing the need to rely on production data. It can also assist in building smart tests that adapt as your pipeline grows, offering better test coverage without the need for manually writing and maintaining every test case.

As testing evolves, I'm confident that data teams will have better tools at their disposal to catch errors early, reduce cloud costs, and deliver more reliable pipelines.

Summary

Testing is crucial to any data engineering workflow, but beginners often fall into common traps that slow them down. Let's quickly recap the critical mistakes to avoid:

Testing too much: Excessive tests increase cloud costs and slow down your process.
Mixing low- and high-level logic: Focus on testing low-level code with unit tests and high-level workflows with integration tests.
Ignoring data relevance: Not all data needs the same level of testing. Prioritise critical datasets and business-impacting data.
Overcomplicating the test environment: Keep your tests lean and focused on key areas to avoid spending more time writing tests than delivering value.
Lack of a testing strategy: Plan your testing approach at each stage of the pipeline and set clear goals for test coverage.

Now it's time for action. Review your current testing practices. Are you making any of these mistakes? Start by simplifying your test environment and focusing on high-impact areas. Make sure your tests align with business goals, and don't be afraid to trim down unnecessary tests.

Until next time,
Yordan

Be an Ambassador

Did you know? I wrote an extensive Snowflake learning guide. And you can have this for free!

You only need to share Data Gibberish with 5 friends or coworkers and ask them to subscribe for free. As a bonus, you will also get 3 months of Data Gibberish Pro subscription.

Share Data Gibberish

How Am I Doing?

I love hearing you. How am I doing with Data Gibberish? Is there anything you’d like to see more or less? Which aspects of the newsletter do you enjoy the most?

Use the links below, or even better, hit reply and say “Hello”. Be honest!

Nikita

First of all, good article, thank you.

Could you elaborate on the testing pyramid for dbt?

You mentioned writing unit tests for transformations (not sure what that is in dbt if not models) and integration tests for models. What do you call integration tests in this scenario? (dbt is pretty self contained)

Expand full comment

1 reply by Yordan Ivanov