Data Lakes For Complete Noobs: What They Are and Why The Hell You Need Them
Everybody around you and me talk about warehouses and lakehouses. Here’a why data lakes are not dead yet
Brought to you by
Product for Engineers
Product for Engineers is PostHog’s newsletter dedicated to helping engineers and founders improve their product skills. Subscribe for free to get curated advice on building great products, lessons (and mistakes) they’ve learned building PostHog, and deep dives on the culture and strategies of top startups.
Greetings, curious reader,
You’ve probably heard the term “data lake” thrown around in tech conversations. It sounds fancy, but what does it actually mean?
A data lake is simpler than it sounds, and by the end of this article, you’ll understand why it’s still a big deal.
🤔 What Is a Data Lake?
🌊 The Natural Lake Analogy
Think of a natural lake. It holds water from rain, rivers, and snow without filtering or processing it first.
A data lake works the same way but for data. It stores raw information in its original format—structured data from databases, semi-structured data like JSON or XML, and unstructured data like images, videos, and text files.
Unlike a data warehouse, which requires data to be cleaned and structured before storage, a data lake allows you to store everything first and decide how to process it later. This flexibility makes data lakes useful for businesses handling diverse data sources and unpredictable future needs.
Or simply said:
A data lake is a directory on a remote server where you store all sorts of files.
💾 How Data Lakes Store and Process Information
Data lakes function as central repositories for all types of data, whether transactional, operational, or analytical. They store raw data at scale without requiring transformation upfront.
When you need to process or analyse the data, you can use a variety of processing frameworks, including Apache Spark, Pandas, and modern database engines like DuckDB. These tools allow organisations to extract insights when needed rather than defining strict rules before data ingestion.
Data lakes support multiple formats, including Parquet, Avro, and CSV, which enable efficient querying and storage. Organisations can load data in real-time or in batch processes and decide later how to structure and process it.
💡 Why You Need a Data Lake
💰 Scalability and Cost Efficiency
In the past, we used tools like Hadoop to build data lakes. But these were expensive to build and maintain. I haven't seen anybody building a new data lake using Hadoop in at least 5 years.
Modern data lakes handle massive amounts of data without breaking the bank.
They use technologies like Amazon S3, Google Cloud Storage, or Azure Blob Storage, which are cheaper than traditional data warehouses.
With a click of a button, you have virtually unlimited storage. You can store customer logs, emails, images, and more without worrying about upfront processing costs.
🤖 Advanced Analytics and Machine Learning
Raw data is gold for data scientists. It lets them experiment and uncover insights that pre-processed data might hide.
For instance, a data lake can store IoT sensor data for future analysis, even if you don’t know how to use it.
This flexibility is crucial for businesses that want to stay ahead.
🛒 Real-World Example
Imagine you run an e-commerce business. Your data lake could store:
Customer purchase histories
Website click-stream data
Customer service chat logs
Social media mentions
Product images
Today, you might only analyse purchase data. Tomorrow, you could combine it with chat logs to improve customer satisfaction. Businesses often integrate data lakes with machine learning models.
For example, recommendation engines can leverage raw user interactions, product reviews, and purchase behaviour stored in a data lake. By running models against these datasets, businesses gain better insights into customer preferences.
I’ve had similar cases more than a couple of times.
⚔️ Data Lakes vs. Data Warehouses
⚖️ Key Differences
Data warehouses require structured data. They’re great for predefined queries and reports.
Data lakes store raw data. They’re ideal for exploratory analysis and machine learning.
🤷♂️ When to Use Each
Use a data warehouse if you need consistent, structured data for regular reporting.
Use a data lake if you want the flexibility to analyse raw data in new ways.
Businesses often use both, leveraging a data warehouse for operational reporting while maintaining a data lake for exploratory research, AI, and machine learning.
🏠 Why Not Use a Lakehouse?
I love using data lakehouses. If you, like me, love open-source technologies and want a single solution for analytics and data science, data lake houses are excellent.
However, there are some valid reasons why you might not need one.
🧩 Complexity and Overhead
Lakehouses combine the best of data lakes and warehouses. But they add complexity.
You need tools like Delta Lake or Apache Iceberg, which require extra setup and maintenance.
💸 Cost Considerations
Lakehouses are cost-effective compared to warehouses but pricier than data lakes.
A data lake might be enough if you’re storing raw data for occasional analysis.
🎯 Use Case Alignment
For some use cases, a data lake is more straightforward and practical.
For example, a research institution storing raw scientific data might not need a lakehouse.
⚠️ Common Pitfalls and How to Avoid Them
💀 Turning Your Lake into a Swamp
Without proper metadata and governance, data lakes can become unusable.
Solution: Implement metadata management and data cataloguing tools.
🚨 Security Risks
Storing raw data can expose sensitive information.
Solution: Use encryption and access controls to protect your data.
📥 Overloading the Lake
Don’t dump everything into your data lake without a plan.
Solution: Define clear data ingestion and storage policies.
🏗️ Step-by-Step Guide to Building a Data Lake
🛠️ Tools and Resources
Start with cloud platforms like AWS, Azure, or Google Cloud.
Use modern processing tools like Polars, DuckDB, or Daft for analysis.
📝 Action Plan
Define Your Goals: Decide what data you’ll store and why.
Choose a Storage Solution: Pick a platform that fits your budget and needs.
Set Up Metadata Management: Use tools like Apache Atlas or AWS Glue.
Implement Security Measures: Encrypt data and control access.
Start Small: Test with a small dataset before scaling up.
💭 Final Thoughts
Lakehouses are the cool kids in the hood now. But you should not scratch data lakes yet.
Data lakes are here to stay. They offer unmatched flexibility and scalability for modern data needs.
They are extremely easy to set up. You can spin one in minutes! Plus, sometimes, you only need a folder on someone else's computer.
As your data engineering journey grows, you’ll see how data lakes fit into larger architectures like lakehouses.
But for now, focus on mastering the basics.
🏁 Summary
Data lakes are centralised repositories for raw data. They’re flexible, scalable, and cost-effective.
You can store structured, semi-structured, and unstructured data without upfront processing.
This makes them ideal for advanced analytics, machine learning, and evolving business needs.
Ready to dive deeper? Comment or reply to this email with your biggest question about data lakes.
Until next time,
🚀 What’s Next?
😍 How Am I Doing?
Your feedback shapes Data Gibberish. Which parts do you love? What would you like more of? Hit reply or use the links below—be honest.
✨ Sponsor This Newsletter
Help me grow and reach more curious minds. Your support fuels fresh insights every week. Together, we make it happen.
Hey Yordan. Thank you really loved the article
Good one Yordan!
It helped me to settle down the idea of data lake and the difference with data warehouse.
I would to read more about the Lakehouse concept 🤔🤓