Zero to Hero: Mastering Change Data Capture For Remarkable Database Integrations
All you need to know to talk about CDC confidently
Are you struggling to capture your customers' journey history?
You're not alone. Many businesses grapple with the challenge of piecing together customer journeys, often resulting in lost events and costly, error-prone processes.
Change Data Capture streamlines data tracking, bypassing traditional eventing system headaches. You gain a detailed data record by leveraging CDC, enabling informed decisions and clear business insights.
In this article, we'll explore the ins and outs of CDC, from the foundational Write-Ahead Logs to practical implementation tips. You'll learn when CDC shines brightest and when a more straightforward approach may suffice.
Read time: 7 minutes
Picks Of The Week
There is a lot of content about breaking into data, which is great! This week
filled a gap by writing about how to go from senior to staff data engineer. (link)I constantly tell people around me how engineering is more about people than tech. In his last piece,
expands on that idea. (link)The job market is a mess right now, and I believe it’s time for more people to become solopreneurs. Learn how
left Facebook for data consulting. (link)
Are you struggling to capture your customers' journey history?
You're not alone. Many businesses grapple with the challenge of piecing together customer journeys, often resulting in lost events and costly, error-prone processes.
Change Data Capture streamlines data tracking, bypassing traditional eventing system headaches. You gain a detailed data record by leveraging CDC, enabling informed decisions and clear business insights.
In this article, we'll explore the ins and outs of CDC, from the foundational Write-Ahead Logs to practical implementation tips. You'll learn when CDC shines brightest and when a more straightforward approach may suffice.
Your Task
Say your company needs a complete view of customers’ journey. They want to track trial starts, plan subscriptions, and churn reasons.
How do you collect this information?
You might first think of implementing event tracking, resulting in a table like this:
This table shows subscription details. For other data, simply create a new event.
However, this method has drawbacks:
No information about your existing users;
Need to implement an eventing service;
Can lose events for many different reasons.
Moreover, it's slow, costly, error-prone, and demands significant engineering time.
Let's explore a better history-tracking method.
Understanding Change Data Capture
My favourite way to track history is through the application’s database. It's ready-to-use, fully equipped, and requires no extra coding.
The approach I am talking about is called Change Data Capture, or CDC. As the name suggests, this is a process where you capture every change in the source data. Think of the CDC as a meticulous historian in your database, making sure every change is noticed.
But why does this matter?
With Change Data Capture, you get a complete record snapshot each time it changes.
The best part about Change Data Capture is that you get a complete snapshot of a record when you change it. This lets you see not only the record's current state but also precisely how it looked at any point in time.
Lacking data confidence is like driving with a foggy windshield - you can't navigate effectively.
CDC clears the fog, providing a reflection of your data landscape. This means you can make informed decisions swiftly, keeping you one step ahead.
But how do you keep track of changes?
You may first think to run SELECT queries periodically. I've been there. Querying the database hourly and writing results to Redshift. However, the snapshots were flawed, and I overloaded the source database needlessly.
Convinced about CDC and eager to start?
Here's where to begin. There are a few ways to make CDC happen, but today, I want to focus on my favourite one. Let’s start with some background.
Under The Database Hood With Write-Ahead Logs
Imagine writing an important letter and jotting down notes on a sticky pad first to avoid losing your thoughts if your pen quits. This is what databases like PostgreSQL and MySQL do each time you run an INSERT, UPDATE, or DELETE query.
Before changing the database, these changes are recorded in a special log: the Write-Ahead Log or the WAL.
The WAL is a sequential record of all database transactions. Each entry in the WAL describes a change to the database. The fundamental principle here is that these entries must be saved or flushed to a permanent storage system before the changes are applied to the database files.
The Write-Ahead Log ensures safety and efficiency. After a crash or power failure, the database uses the WAL to redo transactions, preserving data and consistency. This process is known as roll-forward recovery or REDO.
Okay, enough database detail for now. Let's see how Write-Ahead Logs benefit your Change Data Capture strategy.
Implementing CDC
Collecting the history is relatively straightforward. Believe it or not, databases give you access to internals like the Write-Ahead logs. Connecting to the database and fetching the logs is quick and light.
For PostgreSQL, you may need to install a plugin like wal2json to simplify the reading process, while for MySQL, reading the Binary Log directly is easy enough. I won't detail the process now but reach out for future implementation tips.
To fully track your data's history and boost performance, capture all changes without loss. Keep transactions in order, processing and storing changes in your data warehouse.
Do not update existing records! Treat CDC tables as append-only.
This means you will end up with many records for the same object:
A record when you create a new user
A record for each updated
A record when you delete the user
This is a powerful concept, but be careful. As you might have already guessed, the size of your tables will proliferate. So, don’t waste money capturing history on tables you don’t need.
Now, you might be thinking, "CDC sounds great, but is it always the right choice?" The answer is not always.
When to Use Change Data Capture (And When Not To)
Change Data Capture shines in scenarios where maintaining a comprehensive historical record is crucial or when you need to support incremental loading to improve performance. However, CDC might introduce complexity in setup and maintenance. It's all about finding the right tool for the job.
Let’s expand on these ideas.
When to Use CDC
Real-Time Data Synchronisation: Use CDC to keep data synchronised across different systems in real-time or near real-time. It's ideal for ensuring that changes in one database are immediately reflected in another system.
Event-Driven Architectures: CDC is perfect for event-driven architectures where actions are triggered by data changes. It can capture events as they happen, enabling responsive and dynamic systems.
Data Warehousing and Analytics: If you're aggregating data from various sources into a data warehouse for analytics, CDC can efficiently capture incremental changes. This reduces the need for full data loads, saving time and resources.
Audit and Compliance: When you must maintain a detailed audit trail of changes for compliance purposes, CDC can automatically track and record all changes, providing a comprehensive history of data modifications.
When Not to Use CDC
Simple Data Replication Needs: If your data replication needs are straightforward and do not require real-time syncing, simpler replication methods might suffice and be more accessible to implement.
Static Data Sets: CDC is overkill for data that rarely changes or is static. In such scenarios, periodic data refreshes might be more appropriate and resource-efficient.
Limited Resources: Implementing CDC can require additional resources and expertise. If your organisation has limited technical capacity or the cost outweighs the benefits, it might be better to explore alternative solutions.
Okay, this week’s piece became more protracted than expected. Let’s wrap it up.
Conclusion: Embracing CDC for Data Integration Excellence
You learned a lot today. We discussed the problems of periodic SELECT statements and understood the concept of Change Data Capture. Then, I told you how databases protect themselves from data loss and how you can benefit from the Write-Ahead mechanism. Finally, I told you when you should use CDC and when going with a more straightforward approach is a better idea.
In conclusion, mastering CDC can significantly elevate your data warehousing strategy. It ensures a complete historical record of your data and brings substantial performance benefits. It's not just about syncing databases; it's about empowering your business to see and understand its progress.
Start exploring CDC tools and platforms that align with your data warehousing needs. By embracing CDC, you're not just capturing data. You're capturing the story of your business, one change at a time.
Ready to become a CDC hero?
How Did You Find This Post?
Did you enjoy that piece? Follow me on LinkedIn for daily updates.
Very detailed post.
Yhis might be worth to share CDC practical pipeline: https://www.junaideffendi.com/p/real-time-cdc-with-aws-dms-snowflake