Your Lakehouse Is an Expensive Staging Area

⏱ 19 min read

There’s a particular kind of enterprise optimism that shows up right after a large data platform investment. The lakehouse is live. The ingestion layer is humming. Kafka topics are flowing. Cloud storage is filling at an impressive rate. Dashboards declare success because raw data is arriving from twenty-seven source systems in near real time. event-driven architecture patterns

And yet, six months later, the business still cannot answer ordinary questions with confidence.

Revenue appears in three places with three different meanings. Customer churn depends on which team defines “customer.” Finance does monthly reconciliation in spreadsheets because the so-called modern platform cannot explain why orders and invoices don’t line up. Data scientists have access to everything and trust almost nothing. The lakehouse, despite all the spend, has become what many enterprise platforms eventually become: an expensive staging area.

That is not a technology failure in the narrow sense. The pipelines may be fast. Storage may be cheap enough. Query performance may even be excellent. The real failure is architectural. We moved data, but we did not move meaning.

This is the central mistake in many modern data programs: treating ingestion as if it were architecture. It isn’t. Ingestion is plumbing. Necessary, important, and utterly insufficient.

If you want a platform that produces durable business value, you need a domain pipeline, not just an ingestion pipeline. You need data products shaped by business semantics, bounded contexts, explicit contracts, and reconciliation logic. You need a system that can say not merely “here is what SAP emitted at 09:14,” but “here is the enterprise definition of booked revenue, how it was derived, which policies were applied, and why this number differs from yesterday’s operational view.”

That is a different design problem.

Context

The modern enterprise data stack often begins with a sensible instinct: centralize access to source data. Pull from ERP, CRM, e-commerce platforms, operational databases, SaaS APIs, clickstreams, partner feeds, and event brokers. Land it all in a lakehouse. Normalize formats. Add cataloging, lineage, and some transformation tooling. Call the result a unified data platform.

This works for a while because centralization solves the first-order mess. It reduces one-off interfaces. It gives analysts a place to start. It shortens the path from “I need access” to “I can query something.” In heavily fragmented organizations, that is progress.

But centralization is not the same as coherence.

Most source systems are not authoritative about the enterprise concepts executives actually care about. They are authoritative only within the narrow boundary of the application that produced the data. CRM has leads and opportunities, but not recognized revenue. ERP has invoices and ledger entries, but not customer intent. Subscription billing has plans and renewals, but not the sales hierarchy that influenced conversion. Each system speaks with confidence inside its own bounded context and with dangerous ambiguity outside it.

This is why a raw or lightly transformed lakehouse tends to accumulate semantic debt. You can ingest forever without creating a trustworthy business view. In fact, more ingestion often makes the problem worse. The platform fills with overlapping representations of the same business reality. Every team creates its own “gold” layer. The center says there is one source of truth; the users quietly maintain four.

The issue is not whether lakehouses are useful. They are. The issue is what role they should play. In many enterprises, the right role is not “the final architecture,” but “the landing zone and historical substrate.” That is valuable. It is just not enough.

Problem

The architecture problem can be stated simply:

An ingestion pipeline preserves source structure; a domain pipeline creates business meaning.

Those are different jobs, and confusing them leads to expensive disappointment.

An ingestion-first architecture typically does the following well:

captures source data quickly
preserves fidelity
supports replay
provides broad access
enables exploratory analysis

What it does poorly, unless extended deliberately, is this:

define canonical business concepts
resolve semantic conflicts across systems
reconcile financial and operational views
encode policy-driven transformations
establish ownership of business data products
create trustworthy data contracts for downstream consumers

The common anti-pattern is to assume that medallion layers or a semantic layer alone will solve this. Bronze, silver, gold is fine as a storage processing model. It is not, by itself, a domain model. A semantic layer can expose metrics definitions, but if the underlying data products are built on unresolved domain ambiguity, the semantic layer is lipstick on a ledger discrepancy.

The result is familiar in large enterprises:

data engineering teams become ticket factories for downstream transformations
analysts reverse-engineer source semantics repeatedly
business teams debate definitions instead of decisions
finance and operations maintain parallel reporting pipelines
every acquisition introduces another semantic fracture
data governance becomes a policing exercise instead of a design discipline

The cost is not only technical. It is organizational. Trust erodes faster than platform capability grows.

Forces

This is where architecture becomes interesting, because the wrong answer is often attractive for good reasons.

Force 1: Raw ingestion is fast to implement

Landing source data into a lakehouse is tangible progress. You can show throughput charts, source coverage, and freshness metrics. Executives like visible motion. Vendors encourage it.

A domain pipeline, by contrast, forces conversations about business meaning, ownership, and policy. Those are slower and politically messier.

Force 2: Source systems carry local truth, not enterprise truth

A customer in CRM is not always a customer in billing. An order in commerce is not necessarily a recognized sale in finance. A shipment in logistics may represent fulfillment, partial fulfillment, or a failed operational attempt depending on context.

If you flatten these into one “customer” or one “order” table too early, you create false certainty.

Force 3: Event platforms multiply representations

Kafka and microservices help distribute change, but they also spread semantics. One service emits OrderPlaced, another emits PaymentCaptured, a third emits InvoiceGenerated. None of these events alone answers “what counts as net sales by region this quarter?” Event choreography is not a business model.

Force 4: Reconciliation is unavoidable

Any architecture dealing with money, inventory, compliance, or customer entitlements eventually runs into reconciliation. Operational systems report what happened locally. Financial systems report what was posted. The differences are not noise. They are often the business.

Force 5: Enterprises need gradual migration

No serious enterprise can stop the line and redesign all data around perfect domains. The architecture has to support progressive migration: coexistence, side-by-side validation, selective domain hardening, and a strangler path off legacy data marts and brittle ETL chains.

Force 6: Central teams cannot own all semantics

A platform team can provide tools, standards, and runtime capabilities. It cannot be the long-term owner of revenue, claims, inventory valuation, provider network quality, or policy exposure semantics. Those belong in domain ownership structures.

These forces point in one direction: keep ingestion broad, but build meaning in domain pipelines with explicit boundaries and contracts.

Solution

The solution is to separate data acquisition from business interpretation, and to make domain semantics a first-class architectural concern.

In practical terms:

Use the lakehouse as a landing and history layer

Preserve source fidelity, retain raw and conformed records, and support replay.

Build domain pipelines on top of it

Each pipeline owns a bounded context such as Customer, Order Fulfillment, Revenue, Claims, Product, or Risk.

Define domain data products explicitly

Not just tables, but governed outputs with business definitions, quality rules, lineage, ownership, and consumption contracts.

Introduce reconciliation as architecture, not afterthought

Reconciliation pipelines compare operational views, financial postings, and domain products. Differences become visible, explainable states.

Use event streams where they help, but don’t worship them

Kafka is excellent for propagating change and decoupling producers from consumers. It does not remove the need for domain interpretation or historical correction.

Migrate progressively using a strangler approach

Stand up new domain products alongside legacy reports and marts. Validate, reconcile, then redirect consumers gradually.

The key idea is simple: the lakehouse stores the evidence; the domain pipeline states the case.

Architecture

A useful way to think about the architecture is as two distinct but connected planes:

Ingestion plane: capture, standardize transport concerns, retain history
Domain plane: apply business semantics, policies, reconciliation, and consumption contracts

Here is the broad shape.

This separation matters because it preserves optionality. You can ingest a new acquisition’s ERP feed within weeks without pretending you already understand its business semantics. Then you can incrementally map it into the relevant domains as those semantics are worked out.

That is healthy architecture. It acknowledges that understanding takes longer than transport.

Domain semantics and bounded contexts

Domain-driven design is the missing discipline in many data platforms. Not because data teams need to imitate software teams mechanically, but because DDD gives us a language for handling semantic conflict without pretending it does not exist.

A bounded context says: within this boundary, terms have specific meanings, models are internally consistent, and translation is explicit when crossing into another boundary.

This is exactly what enterprises need for data.

Take “customer”:

In CRM, a customer may mean a sales account
In e-commerce, it may mean a registered shopper identity
In billing, it may mean a bill-to party
In finance, it may mean a legal counterparty
In service operations, it may mean an installed base location

If you create a single universal customer table too early, you either oversimplify or create a monstrous compromise model no one trusts. Better to maintain domain-specific customer views and then create explicit mapping products where enterprise use cases require alignment.

Likewise for “order,” “policy,” “claim,” “member,” “supplier,” and “asset.” These are not just columns. They are business commitments encoded in systems.

Domain pipeline internals

A proper domain pipeline usually has stages such as:

source interpretation
business key resolution
policy application
state derivation
reconciliation
publication

For example, a Revenue domain product may combine data from order management, billing, returns, finance journal entries, and contract metadata. It must apply rules for recognition timing, cancellations, credits, tax treatment, currency conversion, and legal entity assignment. That is not a generic silver-to-gold transform. That is business logic with accountability.

Notice the exception queue. Real enterprises need somewhere for ambiguity to live. Not every discrepancy can be auto-resolved. Good architecture leaves room for operational judgment.

Where Kafka and microservices fit

Kafka can be immensely useful here, especially in organizations moving toward event-driven microservices. But it should be used with discipline. microservices architecture diagrams

Good uses include:

propagating source changes quickly
capturing business events for downstream domain processing
decoupling source systems from consumers
replaying event history for domain recomputation

Bad uses include:

assuming event topics are already business truth
creating analytics directly on raw service events without semantic hardening
letting every microservice define enterprise metrics by implication

Microservices are bounded contexts in motion. That does not mean their event logs are automatically fit for enterprise reporting. Service events are optimized for local autonomy. Domain data products are optimized for shared business understanding. Related, yes. Identical, no.

Migration Strategy

The migration path matters as much as the target architecture. Grand redesigns die in steering committees. Successful change usually happens through progressive strangling.

The strategy I recommend is this:

1. Stabilize ingestion without overselling it

Keep the current lakehouse ingestion program, but rename its purpose honestly. It is the landing zone, replay source, and historical substrate. Stop calling it the enterprise truth layer unless it really is.

This is more than semantics. It resets expectations and creates room for the right next step.

2. Pick one painful, high-value domain

Choose a domain where semantic confusion causes visible business pain. Revenue is common. Claims is another. Customer 360 is often attempted first, though I prefer something tied directly to financial or operational risk because urgency sharpens ownership.

3. Create a domain-owned product team

This team should include:

domain SME(s)
data engineer(s)
analytics engineer or modeler
architect
product owner
finance or operational reconciliation counterpart where relevant

Platform should enable, not own the semantics.

4. Build side-by-side with legacy outputs

Do not switch consumers immediately. Produce the new domain product alongside the existing warehouse mart or report. Compare outputs for several cycles. Investigate deltas. Classify whether they arise from defects, timing differences, policy changes, or previously hidden source inconsistencies.

5. Formalize reconciliation

This is where many migrations become serious. Reconciliation is not “our numbers are close enough.” It is a structured explanation of variance by category and materiality.

Examples:

timing lag
duplicate source emission
missing master data
currency conversion policy mismatch
return applied to wrong accounting period
account hierarchy mismatch after acquisition

6. Redirect consumers gradually

Move one dashboard, one finance process, one downstream API, one ML feature pipeline at a time. Use compatibility views if needed. Retire legacy transformations only after the new product has proven stable.

7. Repeat domain by domain

The organization learns how to do semantic migration, not just technical migration. That capability compounds.

Here is the migration shape.

7. Repeat domain by domain — Repeat domain by domain

The strangler pattern works here because data consumers can be moved gradually, and because reconciliation provides evidence for confidence. Without reconciliation, migration becomes faith-based. Enterprises rightly distrust faith-based reporting.

Enterprise Example

Consider a global manufacturer with direct e-commerce, distributor channels, regional ERPs from acquired companies, and a central finance platform. They invested heavily in a cloud lakehouse. Source onboarding went well: SAP instances, Salesforce, Shopify, logistics feeds, and Kafka events from order services all landed centrally.

The platform team declared success. Then quarter-end happened.

Sales leadership reported one number for gross orders. Finance reported another for net revenue. Regional operations had a third number for shipped sales. The differences were not small. They were structurally recurring.

Why?

Because the lakehouse had centralized records, not enterprise meaning.

The order service emitted orders at checkout.
ERP reflected orders accepted into fulfillment.
Billing created invoice documents after shipment.
Returns were posted days or weeks later.
Finance recognized revenue by accounting policy, not order date.
Acquired regions had local product hierarchies not aligned to the global catalog.
Distributor sell-in and sell-through were blended in some reports and separated in others.

The original architecture treated all this as a transformation backlog problem. It wasn’t. It was a domain problem.

The company reorganized around three initial domain pipelines:

Order Intake
Fulfillment
Revenue

Each domain had explicit ownership and contracts. The lakehouse remained the historical substrate. Kafka events were consumed, but not treated as final truth. Revenue became the first strategic domain because quarter-end trust was the burning platform.

The Revenue pipeline did several things the ingestion architecture never had:

mapped local ERP posting patterns into a common revenue event model
resolved legal entity and region assignment rules
linked invoices, credits, and returns to commercial order lineage
applied accounting timing policies
generated reconciliation views against the general ledger
published certified domain outputs for finance and management reporting

The first three months were uncomfortable. Variances surfaced everywhere. But that discomfort was productive. The architecture had finally made hidden semantic fractures visible.

Within two quarters, finance retired several spreadsheet reconciliations. Sales and finance still had different views for some use cases, but now the difference was explicit and intentional: bookings versus recognized revenue. That is healthy. One number is not always the answer. Knowing which number answers which question is.

This is what good enterprise data architecture looks like in practice. Not false unification. Disciplined translation. enterprise architecture with ArchiMate

Operational Considerations

A domain pipeline architecture shifts the operational burden. It reduces some chaos and introduces some rigor. You should go in with your eyes open.

Data contracts

Contracts should exist at multiple levels:

source-to-ingestion technical contracts
ingestion-to-domain structural contracts
domain-to-consumer semantic contracts

A field existing is not the same as a field meaning the same thing over time. Version your semantics, not just your schemas.

Observability

Measure more than freshness and job success. Add:

reconciliation variance by category
domain rule failure rates
unmapped business keys
late arriving events by source
semantic drift indicators
policy version usage

If your dashboards only show pipeline latency, you are monitoring plumbing, not business trust.

Late data and correction handling

In real systems, data arrives late, out of order, duplicated, or revised. Domain products need explicit policies for:

backfills
retroactive corrections
restatements
snapshot versus event-derived state
slowly changing reference data

This is especially important with Kafka and event streams. Replay is powerful, but replay without deterministic policy control is just a faster route to inconsistent answers.

Stewardship and exception management

Some discrepancies need human intervention. Build workflows for exception review, remediation, and annotation. If every unresolved issue becomes ad hoc Slack archaeology, the platform will rot socially before it rots technically.

Security and regulatory boundaries

Bounded contexts can help here too. Customer, claims, and healthcare member data often require stricter controls than generic operational telemetry. Domain pipelines can enforce policy boundaries more cleanly than broad open lake access.

Tradeoffs

There is no free lunch here. A domain pipeline architecture is better for meaning, but it costs more than blind ingestion.

Benefits

higher trust in business data
clearer ownership
better support for financial and operational reconciliation
stronger alignment with domain-driven design
easier consumer adoption through explicit products
reduced duplication of downstream semantic logic

Costs

slower initial delivery per domain
more dependency on business participation
more governance work up front
potential overlap across bounded contexts
need for architectural discipline to avoid a new sprawl of “domain products” with weak definitions

This is a trade worth making when the business depends on shared meaning. It may not be worth making for purely exploratory or low-stakes analytical workloads.

One memorable rule of thumb: if the consequence of being wrong is a meeting, use ingestion. If the consequence of being wrong is a financial restatement, use a domain pipeline.

Failure Modes

Let’s be blunt. There are several easy ways to get this wrong.

Failure mode 1: Rebranding ETL as domain architecture

If teams simply rename gold tables as “data products” without domain ownership, semantic definitions, and reconciliation, nothing has changed.

Failure mode 2: Forcing a universal canonical model too early

The dream of one enterprise-wide canonical everything usually collapses under real variation. Start with bounded contexts and explicit translations.

Failure mode 3: Over-centralizing semantic authority

A central data office cannot author all business meaning. It can set standards and arbitration mechanisms, but domain truth needs domain ownership.

Failure mode 4: Ignoring reconciliation because it is messy

This is the most common enterprise mistake. Reconciliation is where trust is won. Avoid it, and users will build parallel controls outside the platform.

Failure mode 5: Treating Kafka topics as final truth

Streams are useful evidence. They are not self-justifying business products.

Failure mode 6: Migrating consumers too early

If you cut over before side-by-side validation is mature, every discrepancy becomes a political crisis.

When Not To Use

You do not need this architecture everywhere.

Do not use a full domain pipeline approach when:

the use case is exploratory analysis with low consequence
the source system is already the trusted business system for that question
the data has short-lived tactical value
the organization lacks any domain ownership model at all
the volume of semantic disagreement does not justify the complexity
a simple warehouse mart can solve the problem adequately

Also, if your enterprise is still at the stage where basic ingestion reliability is poor, fix that first. Domain semantics built on unstable acquisition are castles on sand.

There is a sequencing issue here. Do not jump to sophisticated domain architecture because it sounds mature. Use it when the economics of trust justify it.

Several adjacent patterns often fit well with this approach.

Data products

Useful when they are truly owned, documented, versioned, and consumed through explicit contracts. Useless when they are just renamed datasets.

Data mesh

Helpful as an organizational framing if you already have real domains and platform enablement. Harmful when used as a slogan to decentralize chaos.

Medallion architecture

Good as a processing and quality progression pattern. Insufficient as a semantic architecture on its own.

CQRS-style read models

Relevant when operational APIs or internal applications need domain-specific projections derived from events and source records.

Event sourcing

Powerful in narrow contexts, especially within microservices. Dangerous to confuse with enterprise historical truth across systems unless boundaries are tightly managed.

Master data management

Still relevant, particularly for identity resolution and reference alignment. But MDM does not replace bounded contexts; it supports carefully chosen shared reference needs.

Summary

A lakehouse is a valuable component. But in many enterprises, it is being asked to play a role it cannot play by itself. Raw ingestion creates access. It does not create meaning. Technical centralization does not resolve domain ambiguity. More pipelines do not equal more truth.

If your data platform is full of activity but starved of trust, the missing piece is usually not another ingestion connector. It is architecture that respects business semantics.

Use the lakehouse as the place where evidence lands and history is preserved. Then build domain pipelines that encode bounded contexts, apply policy, reconcile competing views, and publish trustworthy data products. Migrate progressively. Reconcile relentlessly. Let domain ownership define meaning, and let the platform make that meaning operable.

The difference between an ingestion pipeline and a domain pipeline is the difference between moving boxes into a warehouse and running a business.

Too many enterprises have built very elegant warehouses.

Now they need a business.

Frequently Asked Questions

What is enterprise architecture?

Enterprise architecture aligns strategy, business processes, applications, and technology in a coherent model. It enables impact analysis, portfolio rationalisation, governance, and transformation planning across the organisation.

How does ArchiMate support architecture practice?

ArchiMate provides a standard language connecting strategy, business operations, applications, and technology. It enables traceability from strategic goals through capabilities and services to infrastructure — making architecture decisions explicit and reviewable.

What tools support enterprise architecture modeling?

The main tools are Sparx Enterprise Architect (ArchiMate, UML, BPMN, SysML), Archi (free, ArchiMate-only), and BiZZdesign. Sparx EA is the most feature-rich, supporting concurrent repositories, automation, scripting, and Jira integration.