Data Lakes For Complete Noobs: What They Are and Why The Hell You Need Them

Everybody around you and me talk about warehouses and lakehouses. Here’a why data lakes are not dead yet

Feb 05, 2025

Brought to you by

Product for Engineers

Product for Engineers is PostHog’s newsletter dedicated to helping engineers and founders improve their product skills. Subscribe for free to get curated advice on building great products, lessons (and mistakes) they’ve learned building PostHog, and deep dives on the culture and strategies of top startups.

Subscribe for free

This article is part of the Foundations of modern data engineering playlist. Click here to explore the full series.

A female engineer wearing a yellow safety helmet rowing a boat in the middle of a lake. She has a determined expression, wearing a work uniform. Around her in the boat are floating charts, CSV sheets, and OCR response documents, scattered but neatly visible. The scene is illustrated in a vintage comic book style, featuring bold black outlines, vibrant yet slightly desaturated colors, and high-contrast shading.

Greetings, curious reader,

You’ve probably heard the term “data lake” thrown around in tech conversations. It sounds fancy, but what does it actually mean?

A data lake is simpler than it sounds, and by the end of this article, you’ll understand why it’s still a big deal.

🤔 What Is a Data Lake?

🌊 The Natural Lake Analogy

Think of a natural lake. It holds water from rain, rivers, and snow without filtering or processing it first.

A data lake works the same way but for data. It stores raw information in its original format—structured data from databases, semi-structured data like JSON or XML, and unstructured data like images, videos, and text files.

Unlike a data warehouse, which requires data to be cleaned and structured before storage, a data lake allows you to store everything first and decide how to process it later. This flexibility makes data lakes useful for businesses handling diverse data sources and unpredictable future needs.

Or simply said:

A data lake is a directory on a remote server where you store all sorts of files.

💾 How Data Lakes Store and Process Information

Data lakes function as central repositories for all types of data, whether transactional, operational, or analytical. They store raw data at scale without requiring transformation upfront.

When you need to process or analyse the data, you can use a variety of processing frameworks, including Apache Spark, Pandas, and modern database engines like DuckDB. These tools allow organisations to extract insights when needed rather than defining strict rules before data ingestion.

Data lakes support multiple formats, including Parquet, Avro, and CSV, which enable efficient querying and storage. Organisations can load data in real-time or in batch processes and decide later how to structure and process it.

💡 Why You Need a Data Lake

💰 Scalability and Cost Efficiency

In the past, we used tools like Hadoop to build data lakes. But these were expensive to build and maintain. I haven't seen anybody building a new data lake using Hadoop in at least 5 years.

Modern data lakes handle massive amounts of data without breaking the bank.

They use technologies like Amazon S3, Google Cloud Storage, or Azure Blob Storage, which are cheaper than traditional data warehouses.

AWS for Data Engineers: Conquer the Cloud in 90 Days

Yordan Ivanov

April 17, 2024

Read full story

With a click of a button, you have virtually unlimited storage. You can store customer logs, emails, images, and more without worrying about upfront processing costs.

🤖 Advanced Analytics and Machine Learning

Raw data is gold for data scientists. It lets them experiment and uncover insights that pre-processed data might hide.

For instance, a data lake can store IoT sensor data for future analysis, even if you don’t know how to use it.

This flexibility is crucial for businesses that want to stay ahead.

🛒 Real-World Example

Imagine you run an e-commerce business. Your data lake could store:

Customer purchase histories
Website click-stream data
Customer service chat logs
Social media mentions
Product images

Today, you might only analyse purchase data. Tomorrow, you could combine it with chat logs to improve customer satisfaction. Businesses often integrate data lakes with machine learning models.

For example, recommendation engines can leverage raw user interactions, product reviews, and purchase behaviour stored in a data lake. By running models against these datasets, businesses gain better insights into customer preferences.

I’ve had similar cases more than a couple of times.

⚔️ Data Lakes vs. Data Warehouses

⚖️ Key Differences

Data warehouses require structured data. They’re great for predefined queries and reports.

Data lakes store raw data. They’re ideal for exploratory analysis and machine learning.

🤷‍♂️ When to Use Each

Use a data warehouse if you need consistent, structured data for regular reporting.

Use a data lake if you want the flexibility to analyse raw data in new ways.

Businesses often use both, leveraging a data warehouse for operational reporting while maintaining a data lake for exploratory research, AI, and machine learning.

🏠 Why Not Use a Lakehouse?

I love using data lakehouses. If you, like me, love open-source technologies and want a single solution for analytics and data science, data lake houses are excellent.

However, there are some valid reasons why you might not need one.

🧩 Complexity and Overhead

Lakehouses combine the best of data lakes and warehouses. But they add complexity.

You need tools like Delta Lake or Apache Iceberg, which require extra setup and maintenance.

💸 Cost Considerations

Lakehouses are cost-effective compared to warehouses but pricier than data lakes.

A data lake might be enough if you’re storing raw data for occasional analysis.

🎯 Use Case Alignment

For some use cases, a data lake is more straightforward and practical.

For example, a research institution storing raw scientific data might not need a lakehouse.

A datalake is just a bucket on S3. It can contain things like CSV, log files, OCR response, images, sound and much more. — Data lakes are ideal when you want to store all kinds of files in the cloud

⚠️ Common Pitfalls and How to Avoid Them

💀 Turning Your Lake into a Swamp

Without proper metadata and governance, data lakes can become unusable.

Solution: Implement metadata management and data cataloguing tools.

🚨 Security Risks

Storing raw data can expose sensitive information.

Solution: Use encryption and access controls to protect your data.

📥 Overloading the Lake

Don’t dump everything into your data lake without a plan.

Solution: Define clear data ingestion and storage policies.

🏗️ Step-by-Step Guide to Building a Data Lake

🛠️ Tools and Resources

Start with cloud platforms like AWS, Azure, or Google Cloud.

Use modern processing tools like Polars, DuckDB, or Daft for analysis.

📝 Action Plan

Define Your Goals: Decide what data you’ll store and why.
Choose a Storage Solution: Pick a platform that fits your budget and needs.
Set Up Metadata Management: Use tools like Apache Atlas or AWS Glue.
Implement Security Measures: Encrypt data and control access.
Start Small: Test with a small dataset before scaling up.

💭 Final Thoughts

Lakehouses are the cool kids in the hood now. But you should not scratch data lakes yet.

Data lakes are here to stay. They offer unmatched flexibility and scalability for modern data needs.

They are extremely easy to set up. You can spin one in minutes! Plus, sometimes, you only need a folder on someone else's computer.

As your data engineering journey grows, you’ll see how data lakes fit into larger architectures like lakehouses.

But for now, focus on mastering the basics.

🏁 Summary

Data lakes are centralised repositories for raw data. They’re flexible, scalable, and cost-effective.

You can store structured, semi-structured, and unstructured data without upfront processing.

This makes them ideal for advanced analytics, machine learning, and evolving business needs.

Ready to dive deeper? Comment or reply to this email with your biggest question about data lakes.

Until next time,

Yordan

🚀 What’s Next?

😍 How Am I Doing?

Your feedback shapes Data Gibberish. Which parts do you love? What would you like more of? Hit reply or use the links below—be honest.

😍 Awesome | 😐 Okay | 🤮 Bad

✨ Sponsor This Newsletter

Help me grow and reach more curious minds. Your support fuels fresh insights every week. Together, we make it happen.

Grow Your Bran Today

Trinesh

Feb 11

Hey Yordan. Thank you really loved the article

Expand full comment

1 reply by Yordan Ivanov

Marcos F. Lobo 🗻🧭

Feb 6

Good one Yordan!

It helped me to settle down the idea of data lake and the difference with data warehouse.

I would to read more about the Lakehouse concept 🤔🤓

2 more comments...

Data Gibberish