Data Products Replace ETL Pipelines

⏱ 22 min read

The old enterprise data stack usually begins with a lie.

The lie is simple: if we can just extract data from enough systems, transform it in one central place, and load it into a nice clean warehouse, the business will finally understand itself. So teams build ETL pipelines like plumbing hidden behind the walls of a house. At first it feels civilized. Data flows. Dashboards appear. Executives nod at charts.

Then the walls start leaking.

The sales pipeline means one thing in CRM, another in finance, and something entirely different in the data warehouse because a reporting team “standardized” it two years ago. Customer means account in one system, subscriber in another, legal entity in a third, and a heroic compromise in analytics. Every change request becomes a ticket to a central platform team already drowning in dependencies. The warehouse grows. The semantics decay. And the business, despite being surrounded by data, trusts less and less of it.

This is why data products matter. Not because “data mesh” sounds modern, and not because Kafka plus microservices creates architectural virtue by mere proximity. Data products matter because they put meaning back where meaning belongs: in the domain. They replace anonymous data movement with owned, intentional, discoverable analytical assets. They turn the reporting swamp into something closer to a market. microservices architecture diagrams

That shift is not cosmetic. It changes who owns truth, how change happens, what gets standardized, and what is allowed to vary. It also changes the migration path out of legacy ETL. You do not rip out pipelines one Friday night and declare victory on Monday morning. You strangle them, reconcile them, and move semantic authority domain by domain.

This article takes a hard look at that transition: why traditional ETL pipelines fail at enterprise scale, how domain-owned data products reshape architecture, where Kafka and microservices fit, what migration really looks like, and when the whole idea is the wrong answer. event-driven architecture patterns

Context

Most enterprises did not set out to build bad data architecture. They inherited it.

A retailer acquires brands. A bank merges regions. An insurer adds digital channels on top of policy administration systems older than some of its staff. Each line of business brings applications, reporting logic, extraction jobs, and definitions. To cope, the organization centralizes. It creates a data warehouse or lake, then a data lakehouse, then a platform team to rationalize the mess. Centralization feels prudent because duplication is scary and governance sounds easier when one team holds the keys. EA governance checklist

For a while, this works. The first generation of reports is often a genuine improvement over operational system chaos. A centralized ETL team can normalize customer identifiers, standardize date dimensions, and expose familiar business metrics. They provide badly needed visibility.

But scale changes the nature of the problem.

As the number of domains grows, the central team becomes a semantic bottleneck. They are expected to understand order capture, settlement, fraud, claims, inventory, pricing, consent, fulfillment, and customer support deeply enough to transform all data correctly. That expectation is absurd. Not because the team lacks talent, but because domain knowledge is not a generic skill. It is local, contested, and dynamic. Meaning lives with the people running the business process.

This is where domain-driven design becomes useful, not as a software modeling fad, but as architectural discipline. DDD reminds us that large systems are not one giant coherent model. They are a collection of bounded contexts with their own language, invariants, and purpose. The same term can mean different things in different contexts and still be perfectly correct. Trying to flatten those differences too early is how enterprises create polished nonsense.

Traditional ETL pipelines are built on early flattening. Data products begin with bounded contexts.

Problem

ETL pipelines fail in enterprises for reasons that are easy to recognize and surprisingly hard to fix.

The first problem is semantic drift. A pipeline usually starts from technical extraction logic and slowly accumulates business meaning. Fields are renamed. Codes are mapped. Joins are introduced. Filters become assumptions. Over time, the pipeline does not merely transport data; it becomes an undocumented model of the business. But the people maintaining it are rarely the people accountable for the business outcomes. So the model drifts from reality.

The second problem is ownership dilution. In most centralized data estates, nobody truly owns the published data set in the way a product team owns a customer-facing capability. Source system teams say, “That’s the analytics layer.” The ETL team says, “We only transform what we receive.” Consumers say, “We don’t trust it, so we built our own version.” The result is a chain of partial accountability, which is really another phrase for no accountability.

The third problem is lead time. A simple business request—say, expose net revenue by channel excluding reversals but including manual adjustments—has to pass through source team interpretation, pipeline backlog, schema changes, testing windows, release schedules, and BI layer updates. By the time the metric lands, the business question has often moved on.

The fourth problem is hidden coupling. A single “enterprise” transformation pipeline often encodes dependencies across dozens of domains. Change a policy status code in underwriting, and suddenly finance extracts fail. Introduce a new order state in commerce, and downstream marketing segmentation breaks. The warehouse seems centralized, but the blast radius is distributed everywhere.

The fifth problem is trust erosion. This is the quiet killer. Once business users discover that “customer churn” means three different things depending on which dashboard they open, they stop arguing about architecture and start exporting spreadsheets. The architecture has already lost.

Forces

Any credible architecture has to respect the forces in play. There are many.

Local domain knowledge versus global consistency.

Domain teams understand their own semantics. The enterprise needs cross-domain reporting. You need both, and they pull in opposite directions.

Autonomy versus governance.

If every domain publishes data however it wants, consumers drown in inconsistency. If a central group dictates all models, domain ownership becomes theater.

Operational truth versus analytical usability.

Event streams and service databases reflect how systems work. Analysts need stable, well-described, query-friendly representations. Raw telemetry is not a data product.

Speed versus reconciliation.

Teams want to move quickly from old ETL to domain-owned products. But finance, compliance, and executive reporting require reconciliation against existing numbers. A mismatch of 0.8% can become a boardroom problem.

Streaming versus batch.

Kafka, CDC, and event-driven microservices make near-real-time products possible. But not every domain needs streaming, and many enterprise controls still operate in batch cycles. Architecture should not romanticize immediacy.

Reuse versus bounded context integrity.

A shared canonical model seems economical. It often becomes semantic imperialism. Yet complete model fragmentation creates translation chaos. This tension must be managed, not wished away.

Solution

The core idea is straightforward: replace central ETL-owned datasets with domain-owned data products.

A data product is not just a table with a nicer name. It is a published, discoverable, governed analytical asset owned by a domain team that understands the business semantics, quality expectations, lifecycle, and consumers. It has a contract. It has metadata. It has support expectations. It evolves intentionally.

This is DDD applied to enterprise data architecture.

The domain that owns orders should publish order data products. The domain that owns payments should publish payment settlement data products. Customer service should publish interaction products. Finance should publish recognized revenue products. These are not raw source dumps masquerading as products. They are semantic interfaces for analytical use.

The center of gravity shifts.

Instead of one giant ETL team translating everyone else’s data, each domain becomes accountable for the analytical expression of its own bounded context. A central platform still matters enormously, but its role changes from semantic author to capability provider. It offers tooling: storage, schema registry, catalog, lineage, access control, quality monitoring, contract testing, event backbone, CI/CD pipelines, and templates for publication.

That distinction matters. Central teams should build roads, not drive every truck.

At the enterprise level, cross-domain insights are assembled from interoperable products rather than from direct extraction into a monolithic warehouse model. Some curated enterprise products may still exist, especially for finance or regulatory reporting, but they are built from domain-published products with explicit reconciliation logic and ownership.

Domain ownership model

The phrase “data products replace ETL pipelines” should not be taken literally as “there is no transformation anymore.” Of course there is transformation. The change is in where transformation is designed, owned, and governed. Data product architecture does not eliminate pipelines; it demotes them from being the architecture to being an implementation detail.

That is healthy.

Architecture

A practical enterprise architecture for data products usually contains five layers.

1. Operational domains

These are the microservices, packaged applications, and legacy systems that run the business. Some emit events to Kafka. Some expose change data capture. Some still produce files nightly because life is unfair. The architecture should accept all three.

2. Domain data product pipelines

Each domain builds publication pipelines that turn operational data into analytical products. These pipelines can be batch or streaming. They enrich, clean, conform internally, and apply domain semantics. Most important: they are owned by the domain team or a federated data team embedded with that domain.

A product should publish:

business definition
owner and support contact
schema and version history
freshness expectation
quality SLOs
access policies
lineage to source systems
usage guidance and sample queries

If those things are missing, it is not a product. It is just data.

3. Shared platform services

This is where centralization still earns its keep. Enterprises need common tooling for:

event streaming infrastructure like Kafka
storage and compute
orchestration
data contracts and schema registry
identity and authorization
catalog and discoverability
observability, lineage, and quality controls
policy enforcement and retention

The platform should reduce the cost of doing the right thing. If publishing a well-governed data product takes six months of platform approvals, teams will route around you.

4. Cross-domain composed products

Not all business questions stay inside a bounded context. Margin analysis may need orders, discounts, returns, shipping, and billing. Risk analytics may need customer behavior, exposure, claims history, and payment status. These cross-domain views should be explicit composed products, with known owners and clear assumptions, not mysterious SQL sediment accumulating in dashboards.

5. Consumption layer

BI tools, notebooks, ML feature stores, APIs, and operational decision engines consume data products. Consumption is easier because products are intentional and discoverable, not scavenged.

Event-driven and batch coexistence

Kafka matters here, but not as a religion. It is valuable when domains already use event-driven microservices or when there is a need for low-latency publication, decoupled integration, and replayability. Event streams can feed streaming data products, support CDC-derived publication, and enable time-aware analytics.

But event streams are not automatically analytics-ready. Business users do not want an endless log of OrderStatusChanged messages with five schema versions and out-of-order arrivals. They want a coherent product: orders with accepted semantics. So domains often need stream processing plus stateful curation before publication.

In many enterprises, the right answer is hybrid: Kafka for event transport and domain change capture, batch compaction or warehouse publication for consumer-friendly products.

Semantic boundaries matter

One of the most abused concepts in enterprise data is the canonical data model. It usually arrives with good intentions and leaves wreckage. A universal enterprise “Customer” object sounds efficient, but it often bulldozes meaningful distinctions among prospect, account, policyholder, subscriber, legal entity, and individual consent subject.

DDD gives us a better instinct: preserve bounded contexts, then map between them deliberately. The customer domain may publish a customer profile product. Billing may publish bill-to-account. Risk may publish insured party. Enterprise composition happens through explicit relationships and translation, not by pretending all contexts mean the same thing.

That is slower on a whiteboard and faster in real life.

Reference architecture flow

Migration Strategy

Here is where many architecture articles become fantasy. They describe the target state as if the present state politely disappears. In real enterprises, legacy ETL pipelines run payroll, board metrics, and regulatory submissions. You do not switch them off because a new pattern is aesthetically superior.

You migrate with a progressive strangler approach.

Step 1: Identify domains and semantic authority

Start by mapping bounded contexts and deciding who owns key business concepts. Not every existing application owner should automatically own a data product. Ownership belongs where domain understanding and accountability actually live.

Pick a few high-value domains where pain is obvious: orders, payments, customer interactions, inventory, claims, or pricing. These are good candidates because they are heavily reused and often misunderstood in centralized warehouses.

Step 2: Expose current lineage and metric definitions

Before changing anything, make the existing ETL estate visible. Which reports consume which pipelines? Where are transformations applied? Which metrics are considered authoritative? What hidden spreadsheet logic fills the gaps?

This step is often politically uncomfortable because it reveals accidental complexity. Good. Sunlight is part of the migration.

Step 3: Publish parallel data products

Domain teams begin publishing their products alongside existing ETL outputs. Do not force immediate cutover. The goal is to establish product shape, metadata, contracts, and quality processes while keeping current reporting stable.

For event-centric domains, Kafka or CDC can feed the publication pipeline. For older systems, scheduled extraction may remain the source for some time. The product model matters more than the transport mechanism.

Step 4: Reconcile old and new

This is the step everyone underestimates.

You need systematic reconciliation between legacy ETL outputs and new data products:

record counts
key aggregates
metric equivalence by period
late-arriving changes
duplicate detection
null and default handling
code mapping differences
historical restatement behavior

Reconciliation is not just technical testing. It is semantic negotiation. Sometimes the new product is “wrong.” Sometimes the old ETL is wrong and has been wrong for years. Sometimes both are correct in different contexts. The enterprise needs to decide whether to preserve historical continuity or adopt improved semantics with explicit communication.

Finance is usually the forcing function here. If recognized revenue in the new model does not tie to the general ledger, the migration stops.

Step 5: Redirect consumers incrementally

Move downstream consumers one cluster at a time: first exploratory analytics, then departmental BI, then executive reporting, then regulatory or audited outputs if appropriate. Each migration should include:

dependency update
dashboard validation
historical comparison
owner sign-off
rollback plan

Step 6: Retire ETL selectively

Only retire legacy pipelines when:

all critical consumers have moved
reconciliation has stabilized
support ownership is clear
lineage and controls meet enterprise standards

A strangler migration succeeds because the old and new coexist long enough for confidence to build.

Migration view

Historical backfill and restatement

One common migration trap is assuming you only need forward publication. Enterprises need history. If a domain product begins today but analysts need three years of trend data, you must either backfill from source systems or maintain coexistence with historical ETL outputs.

Backfills are messy because historical source data rarely matches today’s semantics. Codes changed. MDM rules evolved. Missing keys were tolerated. Expect this. Decide explicitly whether historical data will be restated to new semantics or partitioned into “legacy history” and “product-native history.” There is no universal right answer.

Enterprise Example

Consider a multinational insurer.

It has policy administration platforms in three regions, a CRM, a claims platform, a billing engine, several digital portals, and a large central data warehouse fed by hundreds of ETL jobs. Executives want a unified view of customer value, claims exposure, premium collection, and retention risk. They do not get one. Every report meeting turns into an argument over which extraction logic is current.

The company’s first instinct is the familiar one: expand the central data team, build a new lakehouse, and define an enterprise canonical model for customer, policy, and claim. They spend a year modeling. Nothing gets simpler.

Then the architecture changes direction.

The insurer identifies bounded contexts:

customer engagement
policy administration
billing and collections
claims
finance

Each context gets responsibility for publishing domain data products. The claims domain publishes Claim, Claim Reserve, and Claim Payment products with clear definitions and event dates. Billing publishes Invoice, Collection, and Delinquency products. Policy publishes Policy, Coverage, and Policy Lifecycle products.

Kafka is already in place for digital channels and some newer microservices, so new events flow into publication pipelines quickly. Older regional policy systems still rely on CDC and nightly extracts. That is acceptable. The product interface is standardized even when the source mechanics are not.

A central platform team provides:

schema contract tooling
catalog and lineage
quality scorecards
access control
storage templates
reference dimensions such as region and currency calendars

Notice what they do not do: they do not define what a claim reserve means. Claims owns that.

For cross-domain analytics, the insurer creates a composed enterprise product: Customer Risk Exposure View. It combines customer relationships, active policies, open claims reserves, premium arrears, and service interactions. This product is jointly governed by risk analytics and the contributing domains. It does not pretend there is one universal “customer.” Instead, it documents relationships between party, policyholder, and bill-to account.

Migration starts with claims because the ETL pain is severe and claim metrics are constantly disputed. New claims products run in parallel with warehouse extracts for three quarters. During reconciliation, they discover a nasty issue: the old warehouse excludes reopened claims from certain monthly measures due to a long-forgotten transformation rule. Business leaders had been using underreported figures for years. The new product is semantically correct but politically inconvenient.

This is the real work.

The insurer chooses to preserve old metrics for historical reports marked “legacy basis” while introducing a corrected measure for all new analytics. The choice is communicated explicitly, with finance and actuarial sign-off. Because the migration included reconciliation discipline, the semantic shift is governed rather than accidental.

Within two years, lead time for new analytical products drops sharply. Domain teams support their own products. The central warehouse does not vanish overnight, but it stops being the only place where business meaning is assembled. Trust improves because ownership improves.

That is the point.

Operational Considerations

Elegant architecture fails in operations more often than on diagrams.

Product support model

Every data product needs a named owner, support process, and service expectations. If consumers do not know who to call when freshness slips or schema changes, you have recreated the anonymity of ETL under a new label.

Data contracts

Contracts are essential, especially with Kafka and microservices where schemas evolve frequently. Use versioning rules, compatibility checks, and publication standards. But keep contracts realistic. If every field change requires a governance tribunal, teams will bypass the system. ArchiMate for governance

Observability and quality

Monitor:

freshness
volume anomalies
schema drift
null spikes
duplicate rates
referential integrity
reconciliation deltas
consumer impact

Quality should be visible in the catalog, not buried in platform logs. Consumers need to know whether a product is healthy before they bet a board pack on it.

Security and access

Domain ownership does not mean domain-controlled access without enterprise guardrails. Sensitive data policies, consent enforcement, masking, retention, and regional residency still require centrally enforced controls. Federated ownership must sit inside a common risk framework.

Metadata discipline

The catalog is not a side project. Discoverability is part of the architecture. If analysts cannot find the right product and understand its semantics in minutes, they will return to extraction habits.

Cost management

Domain publication can create sprawl. Multiple products, duplicated storage, and event retention costs add up. Platform teams need chargeback or at least transparent cost reporting. Architecture that ignores economics is just expensive optimism.

Tradeoffs

This approach is better than centralized ETL in many enterprise settings. It is not free.

You gain semantic fidelity, but you lose some superficial uniformity.

Different domains will model reality differently. That is often correct, but it complicates cross-domain consumption.

You reduce central bottlenecks, but you increase federated coordination.

A strong platform and community of practice are mandatory. Otherwise you get many little data silos wearing product badges.

You improve ownership, but you ask domain teams to do more.

Some teams are not ready. They may understand operations deeply but lack analytical modeling capability. Federated enablement or embedded data engineers are often required.

You make lineage clearer, but architecture becomes more distributed.

More moving parts means more need for observability, contracts, and governance.

You can support real-time use cases, but streaming adds complexity.

Out-of-order events, replay behavior, idempotency, and stateful corrections are real engineering concerns. Batch remains the sane choice for many products.

Tradeoffs are the architecture. Anyone promising all upside is selling software.

Failure Modes

There are several predictable ways this pattern goes wrong.

1. Data products in name only

Teams publish raw database extracts, add a README, and call them products. Consumers still do the semantic work downstream. Nothing meaningful has changed.

2. Platform tyranny

The central platform becomes a new bureaucracy. Publication standards are so heavy that domains cannot move. This usually happens when governance is designed for control rather than enablement.

3. Domain anarchy

The opposite failure. Every team invents formats, naming, quality rules, and access patterns. Discoverability collapses. Cross-domain composition becomes painful.

4. Canonical model resurgence

After a few difficult cross-domain conversations, someone proposes a universal enterprise schema to simplify things. Congratulations, you are walking back into the swamp.

5. Unfunded ownership

Domains are told they own data products but receive no engineering capacity, no platform support, and no incentives. Ownership without budget is theater.

6. Ignoring reconciliation

The migration rushes ahead without parallel run and metric comparison. Then executive reports change unexpectedly, trust collapses, and the old warehouse gains another decade of life.

7. Overusing Kafka

Teams push every product into streaming because it feels modern. Consumers get event logs where they needed stable tables. Processing complexity rises; usability falls.

When Not To Use

There are cases where replacing ETL pipelines with domain data products is the wrong move, or at least premature.

Do not use this approach if your organization has very small data needs, a handful of source systems, and one well-functioning central analytics team. A lightweight warehouse with disciplined modeling may be entirely sufficient.

Do not use it if domain ownership is fictional. If teams cannot or will not take responsibility for semantics, quality, and support, federated architecture will underperform centralized ETL.

Do not use it when regulatory requirements demand highly centralized control and the organization lacks mature governance automation. You can still adopt some product thinking, but full decentralization may be too risky.

Do not force streaming infrastructure into a landscape dominated by slow-changing batch reporting needs. Kafka is useful where it solves a problem. It is not a moral upgrade.

And do not begin with enterprise-wide reinvention. If you need to prove the pattern, start with a few painful, high-value domains and show that trust and lead time improve.

Several adjacent patterns complement this approach.

Data Mesh.

Data products are a core building block of data mesh, though many enterprises can adopt product thinking without embracing the full mesh vocabulary.

Bounded Context Mapping.

From DDD, this is crucial for identifying where semantics differ and where translation is required.

Strangler Fig Migration.

Ideal for replacing central ETL gradually while maintaining continuity and reducing cutover risk.

CQRS and Event Sourcing.

Sometimes useful in microservice landscapes where event streams can feed analytical publication. But do not confuse operational event sourcing with analytics architecture; they intersect, they are not identical.

Change Data Capture.

A practical bridge from legacy systems into domain publication pipelines when services and events are not available.

Master Data Management.

Still relevant, but should not become semantic imperialism. Use MDM to manage shared identifiers and reference relationships, not to erase bounded contexts.

Semantic Layer.

A semantic layer can sit on top of data products for governed metrics, but it should amplify domain semantics, not overwrite them blindly.

Summary

Traditional ETL pipelines were built for a world that assumed data meaning could be centralized along with data movement. In small or stable environments, that assumption can hold for a while. In large enterprises, it fails. Semantics drift. Ownership blurs. lead time slows. Trust dies quietly.

Data products offer a better path because they place meaning back in the domain. They treat analytical datasets as products with owners, contracts, quality expectations, and discoverability. They align naturally with domain-driven design by respecting bounded contexts and making translations explicit rather than accidental.

This does not eliminate transformation. It relocates responsibility. It does not remove the need for a central team. It redefines that team as a platform provider, policy enforcer, and enabler rather than the sole author of enterprise truth.

The migration is not a clean break. It is a progressive strangler: map domains, publish in parallel, reconcile relentlessly, move consumers incrementally, and retire ETL only when confidence is earned. Reconciliation is not a side task. It is the price of trust.

Kafka, microservices, CDC, and modern platforms help, but they are supporting actors. The main event is domain ownership of semantics.

That is the real replacement taking place. Not ETL versus no ETL. Not batch versus streaming. The real shift is from pipelines that move data without accountable meaning to products that carry both data and intent.

In enterprise architecture, that is as close to progress as we usually get.

Frequently Asked Questions

What is a data mesh?

A data mesh is a decentralised data architecture where domain teams own and serve their data as products. Instead of a central data team, each domain is responsible for data quality, contracts, and discoverability.

What is a data product in architecture terms?

A data product is a self-contained, discoverable, trustworthy dataset exposed by a domain team. It has defined ownership, SLAs, documentation, and versioning — treated like a software product rather than an ETL output.

How does data mesh relate to enterprise architecture?

Data mesh aligns data ownership with business domain boundaries — the same boundaries used in domain-driven design and ArchiMate capability maps. Enterprise architects play a key role in defining the federated governance model that prevents data mesh from becoming data chaos.