Graph Data Structures for Data Engineers Who Never Took CS101

Your gentle intro to graphs through real examples from data pipelines, DAGs, and lineage — designed for engineers who never learned this in school.

Apr 23, 2025

A comic book style illustration of an Asian male data engineer in a modern office setting. He is wearing a yellow safety helmet and holding a holographic graph in the style of a Directed Acyclic Graph (DAG), with glowing nodes and connecting lines hovering above his hands. Behind him, there are wide futuristic dashboards and data visualizations glowing on high-tech screens. The office features sleek desks, digital equipment, and ambient lighting. The image is in a wide 3:2 aspect ratio with bold dark outlines, flat vibrant colors, and expressive comic book character features. The composition is dynamic and colorful.

Greetings, Data Engineer,

You and I build data pipelines daily. They power models, dashboards, and forecasts. But when someone asks, "What even is a DAG?" most data engineers freeze.

The web’s full of overcomplicated Computer Science (CS) explanations that never connect to our tools — dbt, Airflow, lineage tracking. You either get a deep dive into graph theory or simplified diagrams with zero practical meaning.

This piece changes that.

You'll finally understand graphs — and more importantly, how your tools already use them. You’ll get comfortable with key computer science terms like nodes, edges, topological sorting, and directed acyclic graphs — all through the lens of real data pipelines.

You’ll walk away able to explain how data flows through DAGs, how transformations are ordered, and why it all matters.

Let’s unpack the graph that runs your stack.

But before that, I have a massive announcement.

📣 Announcing: The Data Leader’s Influence System

Tired of being looped in after decisions are made? Want to shape direction?

My course helps data professionals like you:

🎯 Lead with clarity and confidence

🧠 Say "no" (without friction)

🎟️ Get invited early to strategic conversations

Pre-order now and save 40%. Grab it for only $58 until Saturday.

You’ll also get two massive bonuses as a thank-you for being a founding member.

👉 Click here to pre-order and join early

PS: If you are already a subscriber, you will receive 3 emails in the next 3 days to tell you more about the course.

If this is not for you, unsubscribe. This won’t affect your Data Gibberish subscription.

🔍 What Is a Graph (and Why Should You Care)?

In computer science, a graph is a mathematical structure made up of two sets:

A set of vertices (also called nodes)
A set of edges (connections between vertices)

Graphs can be either directed (where edges have a direction, like a one-way street) or undirected (where connections go both ways).

Now, in data engineering, directed graphs are the default.

And when we say DAG — a Directed Acyclic Graph — we mean:

Directed: Each edge has a clear direction
Acyclic: There are no cycles; you can’t go in circles
Graph: A set of nodes and edges

This structure allows us to model execution order.

dbt uses DAGs to decide in what order to run models. Airflow uses DAGs to determine task execution paths. Even lineage tools use DAGs to track how data flows.

This is not abstract CS anymore. It’s your pipeline. Visualised. Controlled.

🧩 Key Concepts You Need to Understand

Let’s map core graph theory ideas to your daily data work.

🧱 Nodes (Vertices)

In Computer Science terms, a vertex is a single point in a graph.

In data engineering, it's an operational unit — a model, a task, a table. In dbt, for example:

my_model.sql is a vertex
A source like raw_customers is also a vertex
So is a test, snapshot, or seed

Each vertex holds attributes — metadata that describes what it does:

name, alias, materialisation
description, tags
Most importantly: its dependencies

These define how this node connects to others.

🔗 Edges

An edge is a link between two nodes.

In a directed graph, an edge goes from one node to another. It’s written as an ordered pair: (A → B) means "A must come before B".

In dbt, this edge exists when:

model_b references model_a using ref('model_a')
Therefore, model_a must run before model_b

These edges define the partial order — a CS term for "some things have to happen before others, but not everything has a strict order."

🚫 Acyclic = No Loops

A cycle in graph theory is a path that starts and ends at the same node.

Imagine model_a → model_b → model_a. That’s a cycle. It creates a loop where nothing knows where to begin.

DAGs don’t allow that. They're acyclic — there is no way to return to the same node.

This guarantees that execution order is well-defined. That’s why tools like dbt and Airflow require acyclic structures. They let the engine calculate a topological sort — the ordered list in which to execute tasks.

Steal my E-Book

🧬 Lineage = Traversing the Graph

Lineage tools walk through the graph using graph traversal algorithms.

They might do a depth-first search (DFS) to follow all paths downstream from a node, or a breadth-first search (BFS) to see what's immediately next.

But from your perspective, lineage just means: "Where did this data come from, and where does it go?"

Full disclosure: One of my team members worked on a lineage task last month. Explaining this made the difference between being stuck and success.

The graph structure makes that possible — and beautifully clear.

🛠️ Step-by-Step: How to Read a DAG in dbt

Open up the dbt DAG view. Here's how to read it using your new CS-lens.

🔍 Step 1: Spot the Nodes

Each box is a vertex.

You’ll see models (.sql files), sources (like raw Snowflake tables), seeds (CSV inputs), and tests.

All are part of the graph’s vertex set.

🔀 Step 2: Follow the Arrows

Each arrow is a directed edge.

It connects two nodes and shows the dependency direction. This means one model must run before the next.

From a graph theory view, the arrow is a directed edge (u, v), where u is upstream and v is downstream.

🧭 Step 3: Trace the Flow

Start at source nodes — they have no incoming edges.

Follow arrows outward. You’re performing a graph traversal, often from left to right.

This shows how data flows and where transformations happen.

🧵 Step 4: Think in Layers

Mentally group the DAG into layers:

Leaf nodes: models with no dependencies — starting points
Intermediate layers: cleaning, enrichment, aggregation
Terminal nodes: final outputs or dashboards — no downstream edges

This structure aligns with how we architect pipelines — from raw input to business-ready outputs.

⚠️ Common Mistakes to Avoid

Let’s bust a few CS-related myths that sneak into data projects.

❌ Misunderstanding Execution Order

The DAG is not just a visual. It encodes a topological order. That’s the order in which the engine will execute your models.

Break the graph? You break the build.

❌ Confusing Ref with SQL

Hardcoding a select * from my_model doesn’t create an edge in the DAG.

Only ref('my_model') does — it creates an actual edge in the graph. Without it, the tool can’t know the dependency.

❌ Accidentally Creating Cycles

This one’s more common than you’d think.

Model A joins B, and someone edits B to now join A. Suddenly — a cycle.

Your build fails because a topological sort becomes impossible. The graph is no longer acyclic.

Circular dependency in a direxted acylcic graph

🧠 Advanced Tips (Optional)

If you want to lean a little further into graph theory and apply it to data workflows:

🪜 Build DAGs Programmatically

Use adjacency lists to define custom DAGs in Python. This is how Airflow lets you define task dependencies: task1 >> task2 creates a directed edge.

You’re literally building the edge set manually.

🧪 Test with DAG Subsets

Sometimes you don’t need to run the whole graph.

Graph theory lets you isolate connected subgraphs — just the models affected by a change. Tools like dbt build --select model+ use this principle.

📐 Think in Graph Terms

When designing a pipeline, sketch a graph:

Nodes = tasks or models
Edges = “depends-on” relationships
Ask: Is it acyclic? Is it connected? Are there isolated components?

This mindset leads to cleaner architectures — and fewer headaches.

🧠 A Cheat Sheet for Data Engineers

This was a lot of info. Here’s a short summary for you.

Keep this in your mental clipboard:

Node = Task or Model: If it runs, builds, or transforms — it’s a node.
Edge = Dependency: ref('other_model') draws an arrow in your DAG.
Direction = Run Order: Edges aren’t just lines — they define what runs when.
Acyclic = Safe Execution: No loops. No infinite recursion. No weird bugs.
Topological Sort = Build Plan: This is how dbt figures out “what runs next.”
Lineage = Traversal: Graphs make it easy to trace how raw data becomes insights.

💭 Final Thoughts

Graphs are not abstract math. They’re how your pipelines run.

You don’t need to memorise algorithms or study CS to understand DAGs. But knowing the core ideas — nodes, edges, direction, cycles, order — gives you power.

You see errors before they happen. You design cleaner models. You debug faster.

And now, you speak the language your tools were built on.

From now on, when someone pulls up a DAG, you won’t just see boxes and lines. You’ll see structure. Logic. Flow.

That’s what graph thinking gives you.

Cheers,

Yordan

😍 What’s next?

Ready to go from order-taker to trusted partner?

The Data Leader’s Influence System is your roadmap.

Pre-order now, save 40%, and learn how to influence with impact.

🎟️ Grab your founding seat here

Data Gibberish