From Ledgers to Intelligence Part 3: ETL and the Relational Pipeline How Data First Moved from Source to Report

Digital Transformation | May 2026

A data warehouse is only as good as the data inside it. The warehouse defines what analysis is possible; the ETL process determines whether that analysis can be trusted. Extract, Transform, Load three words describing what sounds like a simple sequence turned out to be the most technically complex, operationally fragile, and project-budget-consuming component of enterprise analytics for the better part of twenty years.

ETL was not a product. It was a discipline a set of engineering practices for moving data from where it was created (operational source systems) to where it would be analysed (the data warehouse), while resolving the differences between them: different schemas, different data types, different business rules, different levels of completeness and quality. Every organisation that built a data warehouse also built, whether deliberately or accidentally, an ETL architecture. The quality of that architecture determined the credibility of every number the warehouse ever produced.

Enterprise server rooms housing the ETL processing infrastructure of the 1990s and 2000s dedicated servers running nightly batch jobs that moved and transformed data between systems. Credit: Unsplash

The Three-Step Process

Extract meant connecting to source systems operational databases, flat files, mainframe datasets, XML feeds, spreadsheets and pulling data out without disrupting the source. This was harder than it sounds. Source systems were often poorly documented, maintained by separate IT teams, and built on technologies incompatible with the warehouse platform. Change data capture extracting only records modified since the last run required either database log access (often restricted) or a timestamp-based approach that missed deletes entirely.

Transform was where the real work happened. Transformations included: standardising date formats (does “01/02/03” mean January 2nd or February 1st?), reconciling entity identifiers (the same customer might be “ACME Corp” in the ERP and “Acme Corporation” in the CRM), applying business rules (is a sale recognised at order, at shipment, or at invoice?), handling nulls and defaults, splitting concatenated fields, and aggregating transaction-level data into summary records. Each transformation encoded a business decision and those decisions accumulated into a body of institutional logic that lived only inside the ETL code.

Load meant writing transformed data into the target warehouse in the right sequence (dimension tables before fact tables, to satisfy foreign key constraints), handling duplicates, managing slowly changing dimensions (should a customer’s address change create a new dimension record or overwrite the old one?), and completing within the available batch window typically the overnight hours between the close of business and the start of the next day’s operations.

The Tools: Informatica, DataStage, SSIS

By the mid-1990s, a specialised category of ETL tool had emerged to manage this complexity. Rather than writing ETL logic in COBOL or PL/SQL fragile, hard to maintain, and difficult to test these tools provided graphical development environments where data flows were represented as visual pipelines: source connectors feeding into transformation operators feeding into target loaders.

Informatica PowerCenter, founded in 1993 and broadly deployed from the mid-1990s, became the market leader in enterprise ETL. Its architecture a central repository of metadata describing all sources, targets, and transformations meant that impact analysis was possible: changing a source table’s schema could be cross-referenced against all mappings that used it, identifying what would break before deployment.

IBM DataStage (originally from Parallel Technology, acquired by Ascential in 2002 and by IBM in 2005) was the primary competitor, particularly strong in mainframe-connected environments. Microsoft SQL Server Integration Services (SSIS), introduced with SQL Server 2005, brought ETL tooling within reach of Microsoft-ecosystem shops that could not justify Informatica licensing costs. Ab Initio, notoriously secretive about pricing and famously powerful for parallel processing, dominated very large-scale ETL implementations at banks and telecommunications companies.

Server and network infrastructure supporting ETL pipeline operations. ETL jobs were scheduled on dedicated processing servers with network connections to both source and target systems. Credit: Pexels

Staging Layers and Data Marts

As ETL architectures matured, a multi-layer approach emerged. Data from source systems landed first in a staging area a transient store holding raw, unmodified copies of source records. This served two purposes: it insulated the source system (the extract ran once, against staging, not repeatedly against production) and it provided a recovery point if transformation logic failed midway through a run.

From staging, transformations produced the integrated data warehouse layer (in Inmon’s terminology, the enterprise data warehouse). From there, further transformations materialised subject-specific data marts optimised, Kimball-style star schemas for specific business domains: a finance mart, a sales mart, a supply chain mart. Each mart served a distinct group of consumers with data structured for their specific analytical needs.

This three-zone architecture (staging → warehouse → mart) became the canonical enterprise pattern. Its strength was separation of concerns: extraction logic was decoupled from transformation logic, which was decoupled from consumption models. Its weakness was maintenance burden: a change in a source system’s schema could cascade through all three zones, requiring coordinated updates across dozens of mappings.

🏗️ Architecture: ETL Era Architecture (1995–2006)

🏭
Source Systems
(ERP, CRM, Billing,
Flat Files)→📥
Staging Layer
(Raw Landing,
No Transform)→🔄
ETL Engine
(Informatica,
DataStage, SSIS)→🏛️
Enterprise DW
(Normalised,
Historical)→⭐
Data Marts
(Star Schema,
Subject-Specific)→📊
BI Reports
(Cognos, BO,
Crystal Reports)

The Batch Window: A Structural Constraint

Every ETL architecture operated within the batch window the hours during which operational systems were quiescent enough to allow data extraction without performance impact. In practice, this window shrank continuously as global operations expanded business hours and e-commerce extended transactional activity around the clock. A batch window that began at 11 PM and ended at 6 AM in 1998 might shrink to four hours by 2005 as online transaction volumes grew.

The response to shrinking batch windows was incremental loading extracting only changes since the last run and, eventually, micro-batch and near-real-time ETL patterns. But these introduced new complexity: change data capture required database-level access (often restricted by DBA teams), and near-real-time pipelines had to handle late-arriving records and out-of-order events in ways that nightly batch architectures could ignore.

Why ETL Was Brittle

The most persistent failure mode of ETL architectures was the coupling between source system structure and ETL logic. When a source team renamed a column, changed a data type, or restructured a table to accommodate a new business process, ETL jobs failed silently or, worse, loaded incorrect data that persisted in the warehouse until discovered in a monthly reconciliation. Schema changes in source systems were among the leading causes of data quality incidents throughout the 2000s.

A second failure mode was the business logic embedded in transformation code. ETL jobs accumulated years of business rules fiscal year definitions, currency conversion approaches, customer segmentation criteria that were never documented anywhere except in the code itself. When the engineers who wrote them left, the rules became archaeology: preserved in the codebase but opaque to anyone trying to understand them.

These structural weaknesses would eventually drive the shift to ELT (Extract-Load-Transform) a paradigm reversal that we will examine in Part 8. But for a decade and a half, ETL was the architecture of enterprise analytics, and understanding its constraints explains every design decision made to escape them.

References

Kimball, R. & Caserta, J. (2004). The Data Warehouse ETL Toolkit. Wiley.
Vassiliadis, P., Simitsis, A. & Skiadopoulos, S. (2002). Conceptual modeling for ETL processes. Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP, 14–21.
Informatica Corporation (2004). Informatica PowerCenter Architecture Guide.
Inmon, W. H. (2005). Building the Data Warehouse, 4th ed. Wiley.
Simitsis, A. & Vassiliadis, P. (2008). A methodology for the conceptual modeling of ETL processes. Transactions on Systems, Man, and Cybernetics, 38(6), 1128–1148.
Microsoft Corporation (2005). SQL Server Integration Services (SSIS) Technical Reference.
Marco, D. (2000). Building and Managing the Meta Data Repository. Wiley.