Graph Data Structures for Data Engineers Who Never Took CS101
Your gentle intro to graphs through real examples from data pipelines, DAGs, and lineage — designed for engineers who never learned this in school.
Greetings, Data Engineer,
You and I build data pipelines daily. They power models, dashboards, and forecasts. But when someone asks, "What even is a DAG?" most data engineers freeze.
The web’s full of overcomplicated Computer Science (CS) explanations that never connect to our tools — dbt, Airflow, lineage tracking. You either get a deep dive into graph theory or simplified diagrams with zero practical meaning.
This piece changes that.
You'll finally understand graphs — and more importantly, how your tools already use them. You’ll get comfortable with key computer science terms like nodes, edges, topological sorting, and directed acyclic graphs — all through the lens of real data pipelines.
You’ll walk away able to explain how data flows through DAGs, how transformations are ordered, and why it all matters.
Let’s unpack the graph that runs your stack.
But before that, I have a massive announcement.
📣 Announcing: The Data Leader’s Influence System
Tired of being looped in after decisions are made? Want to shape direction?
My course helps data professionals like you:
🎯 Lead with clarity and confidence
🧠 Say "no" (without friction)
🎟️ Get invited early to strategic conversations
Pre-order now and save 40%. Grab it for only $58 until Saturday.
You’ll also get two massive bonuses as a thank-you for being a founding member.
👉 Click here to pre-order and join early
PS: If you are already a subscriber, you will receive 3 emails in the next 3 days to tell you more about the course.
If this is not for you, unsubscribe. This won’t affect your Data Gibberish subscription.
🔍 What Is a Graph (and Why Should You Care)?
In computer science, a graph is a mathematical structure made up of two sets:
A set of vertices (also called nodes)
A set of edges (connections between vertices)
Graphs can be either directed (where edges have a direction, like a one-way street) or undirected (where connections go both ways).
Now, in data engineering, directed graphs are the default.
And when we say DAG — a Directed Acyclic Graph — we mean:
Directed: Each edge has a clear direction
Acyclic: There are no cycles; you can’t go in circles
Graph: A set of nodes and edges
This structure allows us to model execution order.
dbt uses DAGs to decide in what order to run models. Airflow uses DAGs to determine task execution paths. Even lineage tools use DAGs to track how data flows.
This is not abstract CS anymore. It’s your pipeline. Visualised. Controlled.
🧩 Key Concepts You Need to Understand
Let’s map core graph theory ideas to your daily data work.
🧱 Nodes (Vertices)
In Computer Science terms, a vertex is a single point in a graph.
In data engineering, it's an operational unit — a model, a task, a table. In dbt, for example:
my_model.sql
is a vertexA source like
raw_customers
is also a vertexSo is a test, snapshot, or seed
Each vertex holds attributes — metadata that describes what it does:
name
,alias
,materialisation
description
,tags
Most importantly: its dependencies
These define how this node connects to others.
🔗 Edges
An edge is a link between two nodes.
In a directed graph, an edge goes from one node to another. It’s written as an ordered pair: (A → B
) means "A must come before B".
In dbt, this edge exists when:
model_b
referencesmodel_a
usingref('model_a')
Therefore, model_a must run before model_b
These edges define the partial order — a CS term for "some things have to happen before others, but not everything has a strict order."
🚫 Acyclic = No Loops
A cycle in graph theory is a path that starts and ends at the same node.
Imagine model_a → model_b → model_a
. That’s a cycle. It creates a loop where nothing knows where to begin.
DAGs don’t allow that. They're acyclic — there is no way to return to the same node.
This guarantees that execution order is well-defined. That’s why tools like dbt and Airflow require acyclic structures. They let the engine calculate a topological sort — the ordered list in which to execute tasks.
🧬 Lineage = Traversing the Graph
Lineage tools walk through the graph using graph traversal algorithms.
They might do a depth-first search (DFS) to follow all paths downstream from a node, or a breadth-first search (BFS) to see what's immediately next.
But from your perspective, lineage just means: "Where did this data come from, and where does it go?"
Full disclosure: One of my team members worked on a lineage task last month. Explaining this made the difference between being stuck and success.
The graph structure makes that possible — and beautifully clear.
🛠️ Step-by-Step: How to Read a DAG in dbt
Open up the dbt DAG view. Here's how to read it using your new CS-lens.
🔍 Step 1: Spot the Nodes
Each box is a vertex.
You’ll see models (.sql
files), sources (like raw Snowflake tables), seeds (CSV inputs), and tests.
All are part of the graph’s vertex set.
🔀 Step 2: Follow the Arrows
Each arrow is a directed edge.
It connects two nodes and shows the dependency direction. This means one model must run before the next.
From a graph theory view, the arrow is a directed edge (u, v), where u
is upstream and v
is downstream.
🧭 Step 3: Trace the Flow
Start at source nodes — they have no incoming edges.
Follow arrows outward. You’re performing a graph traversal, often from left to right.
This shows how data flows and where transformations happen.
🧵 Step 4: Think in Layers
Mentally group the DAG into layers:
Leaf nodes: models with no dependencies — starting points
Intermediate layers: cleaning, enrichment, aggregation
Terminal nodes: final outputs or dashboards — no downstream edges
This structure aligns with how we architect pipelines — from raw input to business-ready outputs.
⚠️ Common Mistakes to Avoid
Let’s bust a few CS-related myths that sneak into data projects.
❌ Misunderstanding Execution Order
The DAG is not just a visual. It encodes a topological order. That’s the order in which the engine will execute your models.
Break the graph? You break the build.
❌ Confusing Ref with SQL
Hardcoding a select * from my_model
doesn’t create an edge in the DAG.
Only ref('my_model')
does — it creates an actual edge in the graph. Without it, the tool can’t know the dependency.
❌ Accidentally Creating Cycles
This one’s more common than you’d think.
Model A joins B, and someone edits B to now join A. Suddenly — a cycle.
Your build fails because a topological sort becomes impossible. The graph is no longer acyclic.
🧠 Advanced Tips (Optional)
If you want to lean a little further into graph theory and apply it to data workflows:
🪜 Build DAGs Programmatically
Use adjacency lists to define custom DAGs in Python. This is how Airflow lets you define task dependencies: task1 >> task2
creates a directed edge.
You’re literally building the edge set manually.
🧪 Test with DAG Subsets
Sometimes you don’t need to run the whole graph.
Graph theory lets you isolate connected subgraphs — just the models affected by a change. Tools like dbt build --select model+
use this principle.
📐 Think in Graph Terms
When designing a pipeline, sketch a graph:
Nodes = tasks or models
Edges = “depends-on” relationships
Ask: Is it acyclic? Is it connected? Are there isolated components?
This mindset leads to cleaner architectures — and fewer headaches.
🧠 A Cheat Sheet for Data Engineers
This was a lot of info. Here’s a short summary for you.
Keep this in your mental clipboard:
Node = Task or Model: If it runs, builds, or transforms — it’s a node.
Edge = Dependency:
ref('other_model')
draws an arrow in your DAG.Direction = Run Order: Edges aren’t just lines — they define what runs when.
Acyclic = Safe Execution: No loops. No infinite recursion. No weird bugs.
Topological Sort = Build Plan: This is how dbt figures out “what runs next.”
Lineage = Traversal: Graphs make it easy to trace how raw data becomes insights.
💭 Final Thoughts
Graphs are not abstract math. They’re how your pipelines run.
You don’t need to memorise algorithms or study CS to understand DAGs. But knowing the core ideas — nodes, edges, direction, cycles, order — gives you power.
You see errors before they happen. You design cleaner models. You debug faster.
And now, you speak the language your tools were built on.
From now on, when someone pulls up a DAG, you won’t just see boxes and lines. You’ll see structure. Logic. Flow.
That’s what graph thinking gives you.
Cheers,
😍 What’s next?
Ready to go from order-taker to trusted partner?
The Data Leader’s Influence System is your roadmap.
Pre-order now, save 40%, and learn how to influence with impact.