From Brews to Bytes: Demystifying Snowflake's Storage
Learn how Snowflake stores data and what it has in common with beer warehouses
Snowflake is like brewing beer at home — easy to start with but hard to master. Like in brewing, knowing what happens behind the scenes is the difference between mediocrity and expertise. To optimise your queries and control costs effectively, it's essential to grasp the fundamentals of Snowflake’s storage system.
In this issue of Data Gibberish, you will learn:
How Snowflake knows where to look when you run queries
What is the most important metric when you organise your warehouse
When leaving your data cluttered is fine
Read time: 7 minutes
Picks Of The Week
Every time a new tech emerges, people find a way to run Doom on it. I almost lost hope in the Snowflake community. I thought we are not nerdy enough. Last week, Daniel Palma proved me wrong. (link)
- published a banger on Unit Testing For Data Engineers. This one is a great continuation of my dbt unit testing article. (link)
Probably the biggest star in the data community,
, talks about the evolution of the Modern Data Stack from hype to a standard. (link)
The Warehouse
Imagine you work in a large warehouse, managing an extensive beer collection. You occasionally get new deliveries with cases of random beer bottles. Once in a while, your manager, Stacy, asks some wild questions about the inventory.
Your task today is to count the bottles containing exactly 5.2% alcohol. You ask a dozen of your peers for help and start counting the bottles. This will be a long and expensive question to answer.
Snowflake is like a beer warehouse but digital. And instead of beer, it holds data. Every bottle is a record with columns: brand, alcohol content, bitterness and many more. Also, when you get random data, you must stash it and serve it when needed.
Micro-Partitions: The Building Blocks of Data Storage
Instead of boxes, you store your records in files called micro-partitions. And instead of 24 bottles of beer, the micro partition stores between 50 and 500MB of data.
Now, Snowflake knows managers have funny requests. When you load data, it collects metadata and labels each micro-partition. Here’s what you can see on the label:
The number of distinct values for each column
The least and largest value for each column
Some other (undisclosed) optimisation metadata
You like Snowflake’s idea and put a label on every beer case.
Brand: 2; Hills,White Claw; ...
Alcohol: 4; 4.7,5.8; ...
Bitterness: 5; 18,24; ...
This process is extremely helpful. As you go through the warehouse, you can check the labels on the beer boxes and skip those where 5.2 is out of range. Unfortunately, this is not fast enough.
You still need to run through the whole warehouse and check every box. Also, you need to check how many bottles in these boxes have 5.2% alcohol.
But to find a better way to organise your beer, you need a way to compare organisation methods. Let’s see what Snowflake does.
Clustering Depth: Simplifying Data Retrieval
Clustering depth refers to the average number of micro-partitions overlapping for a specified set of columns in a table.
In simpler terms, it's like looking at how many cases you'd have to search to find all the beer boxes with 5.2% alcohol content. The lower the clustering depth, the fewer boxes you need to explore. This indicates a well-organised warehouse (or a well-clustered table).
Imagine initially, the beer bottles are placed randomly throughout the warehouse. If your manager asks you to find all boxes with 5.2% alcohol, you’d likely have to search through many sections and boxes. This signifies a high clustering depth.
In the ideal case, you’d have a clustering depth of one, which means you have all the bottles in one box with a label similar to this one:
Brand: 2; Hills,White Claw; ...
Alcohol: 1; 5.2,5.2; ...
Bitterness: 5; 18,24; ...
This label means all bottles in the box have the same 5.2% alcohol volume. You only need to know how many bottles are in the box. You can answer your manager’s question immediately, and that is a massive win.
So you decide to group all bottles in boxes by the alcohol level. You even sort the boxes, so you know where exactly to look. You just finished the whole organisation overhaul, and Stacy asked you to find out what brand you have the fewest bottles of.
All brands have a selection of beer with different alcohol levels. This organisation doesn’t help answering this new question!
The same applies to Snowflake. Clustering depth is not an absolute value. It varies by column, so you can organise your tables well for some queries and horribly for others. On top of that, just as in real life, you can’t cluster your data by everything.
Automatic Clustering: Managing Snowflake Costs Efficiently
It is important to note that new beer bottles arrive over time, and others are sold or moved. The warehouse organisation might degrade, leading to an increase in clustering depth. So every now and then, when you have some free time, you make an inspection and reorganise beer.
In Snowflake, data changes can affect the clustering of a table and reduce its efficiency. Periodic re-clustering is needed to restore optimal performance. Luckily, Snowflake does that in the background for you. This automated re-clustering incurs some costs, but in my experience, the savings from answering questions far outweigh the re-clustering expenses.
The best thing about the automated re-clustering is the automation. You just define the criteria, and the warehouse worker follows the rules even when you are not around.
Final Considerations: Strategy Over Structure
What questions do you need to answer the most?
As you already learned, you shouldn’t pick a random clustering key. Spend time with your warehouse, collect information, and decide how to organise your beer. You need to know what kinds of questions you get asked the most and make a decision. Let the business problem lead you.
Do you have enough beer?
While a system helps organise your large warehouse, you should consider if you need one for the three beers in your fridge. Clustering a Snowflake table is not worth it if it has only a few gigabytes of data—meaning a few micro-partitions. This is true even if you query the table thousands of times.
What if nobody cares?
Let’s be honest: Do you always query all the data in your Snowflake instance?
The truth is not all data is born equal. In fact, about 70% of data lies unused (link). You don’t need to organise this data. The chances are nobody wants to drink this beer. And even if they wish to, satisfying ad-hoc requests could be cheaper than keeping everything organised.
Summary
This week, we uncovered the basics behind Snowflake’s storage. But instead of using math or complex terms, we used an example from the real world.
Snowflake’s storage mechanism is easy to understand. Your data is split into small files. You peak into each file whenever you query your warehouse. You can control these files and group the data in a way that allows you to retrieve data much faster and cheaper.
Like in your beer warehouse, labelling and sorting your inventory makes it easy to find what you need. You don’t just save time and money but also increase customer satisfaction.
But wait — there is more!
Today, you learned just the fundamentals. In a few weeks, I’ll dive deeper and tell you how you can apply your new knowledge and start living in a much more organised world.
What else do you want to learn? Reply to this email or reach out.
How Did You Find This Post?
Did you enjoy that piece? Follow me on LinkedIn for daily updates.