, , , ,

From Ledgers to Intelligence Part 9: Data Lakes The Promise of Storing Everything (and the Mess That Followed)

Digital Transformation | June 2026

In 2010, James Dixon the then-CTO of Pentaho wrote a blog post that introduced the term “data lake” to the world. Contrasting it with the data mart, which he characterised as a “bottle of water” (cleaned, packaged, structured for a specific purpose), he described the data lake as an analogy for a body of water: all the raw data could flow into it and be accessed directly in its natural form. Consumers would come to the lake and take only what they needed.

The image was appealing. The reality that emerged in most enterprise deployments over the following decade was considerably less serene and prompted a subsequent wave of innovation in data lake governance, cataloguing, and ultimately the architectural rethinking described in Part 10.

A natural lake — the metaphor that gave data lakes their name. Like its natural counterpart, a data lake stores everything that flows into it, in its natural form, accessible by any consumer — but without governance, it becomes a swamp.
A natural lake the metaphor that gave data lakes their name. Like its natural counterpart, a data lake stores everything that flows into it, in its natural form, accessible by any consumer but without governance, it becomes a swamp. Credit: Unsplash

The Economic Logic of Cheap Object Storage

The data lake concept gained traction because of a fundamental economic shift: cloud object storage (Amazon S3, launched in 2006; Azure Data Lake Storage; Google Cloud Storage) made it cheap enough to store virtually unlimited data indefinitely. A terabyte of S3 storage in 2010 cost approximately $140 per year an order of magnitude cheaper than equivalent capacity in a relational database, and several orders of magnitude cheaper than traditional SAN storage.

This economics made a new approach viable: instead of deciding what data to keep (and transforming and loading only the data that fit predefined schemas), organisations could keep everything all raw data, at full fidelity, indefinitely and decide what to use later. The schema-on-read paradigm, in contrast to the schema-on-write of traditional databases, meant data could be stored without defining its structure upfront. That structure would be applied at query time, when the analyst knew what questions they were asking.

For data types that relational databases handled poorly web server logs, JSON API responses, sensor readings, images, audio files the data lake was genuinely liberating. These data types could now be stored in their native formats alongside structured transactional data, in a single unified store, processed by Hadoop or Spark as needed.

The Architecture: Zones and Formats

Mature data lake architectures converged on a zoned approach: multiple layers within the lake serving different purposes and different consumers.

The raw zone (sometimes called “bronze”) held data in its original format as received from source systems no transformation, no cleaning, no schema enforcement. A new Salesforce account record arrived in JSON format; it was stored in JSON. A database table extract arrived as CSV; it was stored as CSV. The raw zone was append-only: records were never overwritten or deleted. It was the audit log of everything that had ever entered the lake.

The curated zone (sometimes called “silver”) held data that had been standardised: converted to a consistent format (typically Apache Parquet a columnar format optimised for analytical queries on distributed systems), cleaned, deduplicated, and enriched. A data engineer’s job was to build pipelines that moved data from raw to curated, applying business rules and quality checks in the process.

The consumption zone (sometimes called “gold”) held datasets specifically prepared for consumption by BI tools or data scientists aggregated, pre-joined, and modelled for specific analytical use cases. The consumption zone was functionally equivalent to the data mart layer in a traditional warehouse architecture, but sitting on object storage rather than in a relational database.

File formats evolved significantly through the data lake era. Apache Parquet, developed by Cloudera and Twitter and open-sourced in 2013, became the de facto standard for analytical data in data lakes. Its columnar format and aggressive compression (particularly effective for low-cardinality columns) reduced both storage costs and query times dramatically compared to row-based formats like CSV or JSON. Apache ORC (Optimized Row Columnar), developed by Facebook for Hive, provided similar benefits with slightly different performance characteristics and is still widely used in Hive-based environments.

Enterprise data infrastructure supporting data lake deployments — the combination of scalable compute (Spark clusters) and cheap object storage (S3, ADLS) that made the data lake economically viable.
Enterprise data infrastructure supporting data lake deployments the combination of scalable compute (Spark clusters) and cheap object storage (S3, ADLS) that made the data lake economically viable. Credit: Pexels

The Data Swamp: What Went Wrong

The promise of the data lake store everything, use anything collapsed in practice for a consistent set of reasons.

No discoverability. As a data lake grew from terabytes to petabytes, from dozens of datasets to thousands finding relevant data became genuinely difficult. A new analyst joining the team might know that a customer dataset existed somewhere in the lake, but not which folder, which file format, which dates it covered, or which version was canonical. Without a data catalogue a searchable inventory of what existed in the lake and what it meant the lake was opaque.

No trust. Data in the raw zone had never been validated. It contained nulls, duplicates, impossible values, and schema variations that its creators considered acceptable because they were loading at speed. Analysts who built reports from raw zone data built reports that contained errors they could not identify without deep knowledge of the source system. The data was present; it was not trustworthy.

No governance. Who owned which dataset? Who was responsible for its quality? When a field’s meaning changed because a source system was updated, or because a business rule changed who was responsible for updating the documentation? In most lake implementations, nobody was. Datasets accumulated without owners; documentation was written once and never updated; field meanings drifted without record.

No ACID guarantees. Object storage stores files; it does not support transactions. Updating a record in a data lake meant either rewriting the entire partition containing that record (expensive and slow) or appending a new version and writing query logic to select the latest one (complex and error-prone). Deleting records required for GDPR compliance meant identifying and rewriting every file containing the relevant data. These operations were not just inconvenient; for organisations with regulatory compliance requirements, they were existential problems.

The Governance Response

The data catalogue emerged as the primary governance tool for the data lake era. Apache Atlas, originally developed at Hortonworks, provided metadata management and data lineage tracking integrated with the Hadoop ecosystem. Amundsen, open-sourced by Lyft in 2019, focused on search and discovery providing a Google-like interface for finding datasets across a lake. DataHub, open-sourced by LinkedIn in 2020, offered a more comprehensive platform covering search, lineage, governance, and quality monitoring. Commercial catalogues including Alation and Collibra offered richer functionality with enterprise support.

These tools addressed the discoverability and documentation problems. They did not address the ACID problem the fundamental mismatch between object storage semantics and the transactional guarantees that reliable data management requires. That problem required a different kind of solution, one that would redefine what a data lake could be.


References

  1. Dixon, J. (2010). Pentaho, Hadoop, and Data Lakes. James Dixon’s Blog, Pentaho.
  2. Armbrust, M. et al. (2013). Shark: SQL and Rich Analytics at Scale. Proceedings of SIGMOD 2013.
  3. Inmon, W. H. (2016). Data Lake Architecture: Designing the Data Lake and Avoiding the Data Swamp. Technics Publications.
  4. Gartner Research (2016). Avoid the Data Lake Pitfalls. Gartner Report.
  5. Apache Parquet (2013). Apache Parquet: Columnar Storage for the People. Apache Software Foundation.
  6. DataHub Project (2020). DataHub: A Generalized Metadata Search & Discovery Tool. LinkedIn Engineering.
  7. Fowler, M. (2015). Data Lake. martinfowler.com.

Enjoyed this article?

Get more like it — weekly insights on AI, data, and enterprise tech.

Discover more from DataOnTheMove

Subscribe now to keep reading and get access to the full archive.

Continue reading