The Lakehouse Is Not Your Data Platform

⏱ 20 min read

Most data platforms fail in a strangely professional way.

They fail with excellent slideware, expensive tooling, and a very modern vocabulary. They fail while everyone nods at words like lakehouse, mesh, real-time, and self-service. They fail because the organization quietly makes one architectural mistake and then builds an empire on top of it: it confuses where data lands with what data means.

That confusion is not a minor modeling issue. It is the root of a lot of enterprise pain.

A lakehouse is useful. Often very useful. It can be the right place for ingestion, storage, replay, historical analysis, machine learning features, and large-scale processing. But a lakehouse is not, by itself, a data platform in the architectural sense that matters to the business. It does not define domain semantics. It does not resolve ownership. It does not tell you what a customer is, when an order is booked, whether a shipment is partial, or why revenue was recognized. It certainly does not make conflicting operational truths disappear.

Put bluntly: ingestion is plumbing; semantics is architecture.

And too many enterprises have spent the last five years perfecting the plumbing while neglecting the building.

This article makes an opinionated case: the right architecture separates ingestion infrastructure from domain semantic products. The lakehouse should be treated as a foundational data substrate, not the source of business meaning. Business meaning belongs closer to domain boundaries, explicit contracts, and governed semantic models. If you miss that distinction, you get a technically impressive swamp. If you embrace it, you get a platform that survives M&A, system replacement, audit scrutiny, and the daily malice of reality.

Context

The modern enterprise usually arrives at the lakehouse honestly.

There are too many systems. ERP, CRM, billing, WMS, e-commerce, policy admin, mobile apps, partner APIs, IoT feeds, and a growing population of SaaS platforms that each export a slightly different CSV-shaped opinion of the truth. Teams are tired of brittle nightly ETL. Data scientists want access. Finance wants consistency. Operations wants dashboards that don’t lie. Compliance wants lineage. Executives want “one version of the truth,” a phrase that has ruined more architecture discussions than almost any other.

So the organization builds a central landing zone. Then a curated layer. Then some gold tables. Then reverse ETL. Then maybe streaming ingestion through Kafka. Then a semantic layer gets mentioned, but in practice semantics are still reconstructed downstream by analytics teams with SQL, notebooks, BI tools, and tribal knowledge. event-driven architecture patterns

This is where things start to drift.

The lakehouse becomes the default answer to every data question. Need customer lifetime value? Put it in the lakehouse. Need regulatory reporting? Lakehouse. Need cross-channel order status? Lakehouse. Need master data resolution? Lakehouse. Need operational event history? Lakehouse. Need machine learning training data? Also lakehouse.

Soon the platform team is running a centralized factory for every unresolved business disagreement in the company.

That is not scale. That is architectural debt with parquet files.

The better way begins with a simple distinction:

Ingestion architecture answers: how do events, files, CDC streams, APIs, and external data arrive reliably?
Domain semantics architecture answers: what do these facts mean, who owns the meaning, how are concepts defined, and how do consumers trust the resulting products?

Those are different concerns. They can be connected. They should not be collapsed.

Problem

The core problem is that raw and curated storage layers are often asked to do work that belongs to domain design.

A lakehouse is very good at collecting data in many shapes and retaining history cheaply. It is much less good at settling semantic disputes between domains with different incentives, timing, and definitions.

Take “customer.” Sales may define customer as an account with a signed agreement. Billing may define it as an entity with an active receivable relationship. E-commerce may define it as a registered user. Service may define it as an installed base location. Compliance may define it according to legal entity hierarchy. If the architecture assumes these can be solved by simply centralizing all source records and “curating” them into a canonical table, then the platform team becomes a reluctant priesthood of business meaning.

That is how bottlenecks are born.

The symptoms are familiar:

Hundreds of bronze/silver/gold tables with unclear ownership
Kafka topics that mirror database tables but carry no business event meaning
BI teams rewriting metric logic repeatedly
“Customer 360” programs that never stabilize
Reconciliation disputes between operational systems and analytics outputs
Platform teams forced to interpret domain rules they do not own
System migrations blocked because too many downstream consumers bind directly to source-shaped data

The underlying issue is not insufficient tooling. It is that the architecture has centered the integration substrate rather than the domain boundary.

This matters even more in event-driven and microservices-heavy estates. Kafka can transport facts; it cannot assign business meaning by magic. A topic named orders is not an order domain model. CDC from an order table is not an order lifecycle event stream. If you publish low-level state mutations without semantic contracts, you simply move the confusion faster.

Speed without meaning is a very efficient path to mistrust.

Forces

Several forces pull enterprises toward the wrong shape.

1. The gravitational pull of centralization

A lakehouse is visible. It has cost curves, vendors, dashboards, and platform teams. Domain semantics are messier. They require negotiation with business units, bounded contexts, ownership models, and governance that works through accountability rather than decree. Central platforms are easier to fund. Domain design is harder to fake. EA governance checklist

2. The desire for reuse

Everyone wants a reusable canonical model. That instinct is understandable and dangerous. Shared semantics are valuable, but premature canonicalization often destroys domain nuance. You end up with a generic model that satisfies nobody and leaks source-system assumptions everywhere.

3. Migration pressure

Legacy systems are being replaced constantly: ERP modernization, CRM replatforming, commerce rebuilds, policy engine replacement, warehouse automation, core banking transformations. If consumers are tightly coupled to source-specific extracts in the lakehouse, every migration becomes a data blast radius event. The organization then realizes too late that it lacked stable semantic interfaces.

4. Real-time expectations

Streaming architecture and Kafka raise the stakes. Business leaders now expect fresh data, not just nightly snapshots. But freshness amplifies semantic ambiguity. If a “shipment delivered” event can arrive before billing closes, before returns windows start, and before partner acknowledgment, what exactly should downstream consumers believe?

5. Audit and compliance

Lineage is not enough. Auditors and regulators care about definitional integrity, controls, and reconciliation. A technically traced pipeline that moves an ambiguous metric from A to B is still ambiguous.

6. Organizational reality

The teams who understand business meaning are rarely the same teams who operate the data substrate. That separation is not a flaw. It is a fact. Good architecture reflects it.

Solution

The solution is to stop treating the lakehouse as the business brain of the enterprise.

Use it as a data substrate for ingestion, persistence, processing, replay, and analytical scale. Then build a layer of domain-aligned semantic products above and around it, with explicit ownership, contracts, and reconciliation logic. In domain-driven design terms, the key move is to separate bounded contexts from transport and storage concerns.

A sane architecture usually has three distinct concerns:

Ingestion fabric

Handles CDC, events, files, APIs, Kafka streams, partner feeds, schema registration, retention, and observability. Its job is reliable movement and preservation of facts.

Domain semantic products

Owned by domain teams or federated data product teams. These products define business entities, events, states, metrics, and contracts in a bounded context. They reconcile source ambiguity and expose trustworthy interfaces.

Consumption and composition layer

BI, ML, operational analytics, regulatory reporting, data science, APIs, downstream applications. Some consumers use raw-ish historical data; many should consume semantic products instead.

That separation changes everything.

Instead of asking the lakehouse to produce “the enterprise customer,” you define domain-specific semantic products such as:

customer billing relationship
retail shopper profile
service install base customer
legal entity hierarchy
cross-domain customer reference product, if and only if the business genuinely needs one

Notice the difference. Semantics are no longer assumed to collapse into one table. They are modeled deliberately.

A good line to remember is this: raw data should be easy to land; trusted meaning should be hard to earn.

A reference architecture

In this model, the lakehouse is essential but not sovereign. Domain products derive from it, and sometimes also directly from streams or operational stores, but they carry the business contract.

This is deeply aligned with domain-driven design:

bounded contexts define meaning
upstream/downstream relationships are explicit
anti-corruption layers protect consumers from source churn
ubiquitous language belongs in the semantic product, not buried in ingestion pipelines

A lot of so-called modern data architecture is really just integration architecture wearing analytics clothes. The fix is to bring back domain thinking. integration architecture guide

Architecture

Let’s make this concrete.

Ingestion vs semantics

The ingestion fabric should optimize for:

broad connectivity
high reliability
replayability
immutable history where useful
schema evolution handling
operational metadata
low-friction onboarding

It should not be where teams casually invent business definitions.

Semantic products should optimize for:

explicit ownership
business vocabulary
versioned contracts
reconciliation and exception handling
quality controls tied to business meaning
consumer trust
resilience to source-system replacement

That means a semantic product is more than a table. It is a package:

model definitions
transformation logic
business rules
data quality assertions
lineage
reconciliations
SLA/SLOs
support model
change policy

This is where many “data mesh” conversations go wrong. They talk about ownership but skip semantics. Ownership without semantic contracts is decentralization of chaos.

Domain events are not CDC

In Kafka-centered estates, one of the most expensive mistakes is to treat CDC streams as business events. They are not the same.

CDC tells you what changed in a database row. A domain event tells you something meaningful happened in the business. Those differ in timing, granularity, and intent.

For example:

CDC: orders.status changed from 2 to 3
Domain event: OrderBooked
CDC: shipment.delivery_timestamp updated
Domain event: DeliveryConfirmedByCarrier
CDC: invoice.paid_flag true
Domain event: PaymentSettled

The semantic product should often translate low-level technical emissions into meaningful domain facts, preserving lineage back to raw records. That translation layer is where bounded context knowledge lives.

Reconciliation is part of the architecture, not cleanup

Enterprises routinely under-architect reconciliation. Then they discover that every serious business process depends on it.

A robust semantic product must answer:

How does this product reconcile to source systems of record?
What are acceptable variances?
How are late-arriving records handled?
How are duplicates, reversals, cancellations, and re-statements represented?
What is the control point for period close or audit?
Which truth is provisional, and which is authoritative?

Reconciliation is not an afterthought. It is how trust survives asynchronous systems.

Diagram 2 — Reconciliation is part of the architecture, not cleanup

Notice the asymmetry here: semantic products are not merely transformed data sets. They are controlled interpretations with feedback loops for exceptions.

Semantic layers should be plural, not singular

There is no law of nature saying the enterprise gets exactly one semantic layer. In practice, you often need:

domain semantic products for operational trust
conformed analytical models for enterprise reporting
feature-oriented abstractions for data science
regulatory views with tightly governed definitions

Trying to flatten these into one universal model usually ends badly. Better to acknowledge multiple semantic viewpoints and manage their relationships explicitly.

Migration Strategy

Most organizations cannot stop the world and redesign their data platform from first principles. Nor should they. The right move is a progressive strangler migration.

Do not rip out the lakehouse. Reframe it.

Start by identifying where the central platform is currently serving as an accidental semantic authority. Then peel those responsibilities into domain-aligned products one by one, leaving ingestion and storage intact where they still add value.

A practical strangler path

Map current consumers

Identify dashboards, reports, data science assets, regulatory extracts, APIs, and operational dependencies that consume lakehouse tables directly.

Classify data assets

Separate:

- raw landed assets

- technical integration assets

- implicit semantic assets

- enterprise reports and metrics

This reveals where semantic logic is currently hidden.

Select high-value bounded contexts

Start with a domain where ambiguity is painful and ownership is clear:

- order lifecycle

- invoice and payment status

- inventory availability

- customer billing relationship

Not “enterprise customer” unless you enjoy suffering.

Create explicit semantic contracts

Define entities, events, state transitions, quality rules, and reconciliation controls. Version them.

Introduce anti-corruption layers

Shield consumers from source schemas and migration churn.

Run semantic products in parallel

Publish side-by-side with existing curated tables. Measure variances. Build confidence.

Cut over consumers gradually

Prioritize high-trust use cases first. Leave exploratory use cases on raw/curated layers longer.

Retire accidental semantic assets

Once consumers have migrated, demote old curated tables to technical artifacts or archive them.

The migration logic that matters

This strategy works because it respects operational reality:

source systems will continue to change
consumers cannot all migrate at once
definitions require negotiation
trust is earned through reconciliation, not slogans

It also preserves optionality. If you later replace ERP or CRM, consumers bound to the semantic product remain insulated. That insulation is one of the most underappreciated payoffs in enterprise architecture.

A strangler migration is boring in the right ways. It reduces risk by making meaning explicit before systems are replaced.

Enterprise Example

Consider a global manufacturer with three major business lines and a decade of acquisitions. It runs SAP for core finance in some regions, Oracle ERP in others, Salesforce for account management, a custom e-commerce stack, multiple warehouse systems, and regional billing platforms. Over time it built a large lakehouse with CDC from major systems, Kafka for application events, and a substantial BI estate.

On paper, this looked modern.

In practice, the central data team had become the referee for endless disputes:

What counts as a booked order?
Which customer hierarchy should revenue roll up against?
When is inventory “available” if it is allocated but not yet picked?
How should returns and warranty replacements affect sales metrics?
Why do finance, supply chain, and commerce dashboards disagree?

The breaking point came during an ERP migration in Europe. Downstream reports and ML models were tightly coupled to source-shaped curated tables built from old ERP extracts. Every source field change triggered remediation across dozens of pipelines and reports. Kafka helped move more data faster, but most topics were CDC-shaped, so downstream consumers still encoded source-specific assumptions.

The architecture team changed course.

They kept the lakehouse and Kafka backbone. But they introduced domain semantic products for:

Order Lifecycle
Billing Relationship
Inventory Position
Product Commercial Hierarchy

Each product had a named owner, explicit schema contracts, business rules, reconciliation to source-of-record controls, and a support model. The Order Lifecycle product, for example, defined events such as OrderCaptured, OrderBooked, OrderAllocated, OrderShipped, DeliveryConfirmed, OrderReturned, each with rules for late-arriving changes, split shipments, cancellations, and restatements.

Importantly, the product did not pretend there was a single universal order truth. It described one bounded context for enterprise reporting and cross-channel operations, while preserving lineage to regional systems.

During the ERP migration, consumer dashboards and supply chain analytics were moved from old curated ERP-shaped tables to the new semantic product. The source mappings changed significantly behind the scenes. Most consumers did not care. That was the point.

Results after 12 months were not miraculous, but they were real:

materially fewer downstream breaks during migration releases
much faster reconciliation for period-end order and revenue controls
reduced duplication of metric logic across BI teams
clearer accountability when definitions changed
improved trust in inventory and fulfillment reporting

The company did not achieve a metaphysical single source of truth. It achieved something better: stable, governed truths for specific purposes.

That is how grown-up enterprises work.

Operational Considerations

A semantic architecture lives or dies in operations, not just design.

Ownership model

Every semantic product needs a real owner. Not a committee. Not a mailbox. A team with authority over definitions, quality thresholds, release cadence, and consumer communication.

Platform teams own the substrate.

Domain-aligned teams own meaning.

If that line blurs, the old failure pattern returns.

Data quality as business control

Quality checks should not stop at null counts and schema drift. Those matter, but they are table stakes. You also need:

state transition validation
duplicate business event detection
period completeness checks
source-to-product control total reconciliation
threshold alerts on business metric variances
late-arrival and correction monitoring

Versioning and change management

Semantic contracts need semantic versioning discipline. Breaking changes should be rare and intentional. Additive evolution should be preferred. Consumer compatibility matters more here than in raw ingestion.

Metadata and discoverability

Catalogs should expose:

business glossary definitions
owner and support contacts
upstream sources
downstream dependencies
quality status
reconciliation status
usage guidance
“do not use for” warnings

A catalog that only lists technical schemas is a phone book, not a platform.

Streaming and batch coexistence

Most enterprises need both. Some semantic products may be micro-batched because correctness matters more than immediacy. Others may expose near-real-time views using Kafka and stream processing. The architectural question is not “real-time or batch?” but “what latency is acceptable for this semantic contract?”

Fresh wrong data is still wrong.

Tradeoffs

This architecture is better, but not free.

Tradeoff: more modeling effort upfront

You spend more time defining bounded contexts, ownership, and contracts. This can feel slower than dumping everything into curated tables. It is slower at first. Then it is much faster when systems change.

Tradeoff: duplication across domains

Some concepts will appear in multiple semantic products with different definitions. That is not always waste. Sometimes it is the cost of preserving business meaning. Forcing convergence too early is often more expensive.

Tradeoff: federated accountability is harder

Central platform teams are easier to manage on org charts. Federated semantic ownership requires stronger product thinking and governance. Some organizations are not culturally ready. ArchiMate for governance

Tradeoff: reconciliation overhead

Building controlled reconciliations takes time and operational discipline. But if the use case matters to finance, compliance, supply chain, or executive decision-making, that cost is not optional. You either pay for reconciliation explicitly or pay for mistrust indefinitely.

Tradeoff: not every data set deserves semantic product treatment

Exploratory, low-risk, or one-off analytical data may not need this level of rigor. Architecture should be selective.

Failure Modes

Even good ideas have reliable ways to fail.

1. Rebranding the curated layer as “semantic”

A team renames silver/gold assets as semantic products without changing ownership, contracts, or reconciliation. Nothing improves.

2. Creating a universal canonical model

The architecture attempts to force all domains into one enterprise ontology. Progress stalls in endless governance meetings. Teams bypass the platform.

3. Confusing event transport with semantic design

Kafka topics proliferate, but they are still low-level technical emissions. Consumers reassemble meaning themselves. The organization now has distributed ambiguity.

4. Platform team owning business semantics indefinitely

This creates a bottleneck and political friction. Platform teams should enable, not arbitrate every definition.

5. Ignoring exception workflows

Reconciliation finds mismatches, but nobody owns remediation. Exceptions pile up. Trust collapses.

6. Overengineering low-value domains

If every data asset must go through full product governance, the platform becomes bureaucratic. Selectivity matters.

A useful test is this: if the business would call a meeting when this number changes, it probably deserves explicit semantic architecture.

When Not To Use

This approach is not universal.

Do not invest heavily in domain semantic products when:

your use case is primarily exploratory analytics on loosely governed data
the organization lacks stable domain ownership and cannot sustain product accountability
the data has short-lived tactical value
consumer trust requirements are low
your platform is at a very early maturity stage and basic ingestion reliability is still unsolved

In those cases, focus first on solid ingestion, storage, metadata, and basic curation. You can add semantic products later where value is clear.

Also, do not mistake this pattern for a license to decompose everything into tiny data products. If your domains are weakly understood, excessive fragmentation will hurt more than help. Bounded contexts need to be discovered, not guessed from org charts.

This architecture sits near several familiar patterns, but it is not identical to any one of them.

Data mesh

Useful for emphasizing domain ownership and product thinking. Dangerous when interpreted as “let every team publish whatever they want.” Mesh needs strong semantic contracts and a capable platform.

Medallion architecture

Helpful as an ingestion and refinement pattern. Insufficient as a semantic architecture. Bronze/silver/gold says little about business meaning.

Canonical data model

Sometimes useful at integration boundaries. Often overused. Canonical models should be narrow and purposeful, not a universal religion.

CQRS and event sourcing

Relevant for operational systems where domain events and read models are explicit. They can inform semantic product design, especially for lifecycle-oriented domains, but most enterprises will still need reconciliation across heterogeneous systems.

Master data management

Important for reference entities and identity resolution. But MDM does not replace bounded-context semantics. Matching records is not the same as defining meaning.

Strangler fig migration

Highly relevant. It is the right migration metaphor here: create stable semantic interfaces, move consumers gradually, then replace underlying systems without breaking the world.

Summary

The lakehouse is valuable. Keep it. Invest in it. Use it for ingestion, persistence, replay, scalable processing, and broad analytical access.

But stop asking it to be the sole custodian of business meaning.

A data platform worthy of the name must distinguish data arrival from data semantics. The first is an infrastructure problem. The second is a domain design problem. Conflating them creates central bottlenecks, weak trust, brittle migrations, and endless definitional disputes.

The better architecture is domain-driven:

ingestion fabric for movement and history
semantic products for owned business meaning
reconciliation as a first-class control
Kafka and microservices used where they help, not as semantic substitutes
progressive strangler migration to escape source-shaped coupling

The memorable line is simple because the lesson is hard-earned:

Your lakehouse can store the facts. It cannot decide what they mean.

That responsibility belongs to domains, contracts, and the architecture disciplined enough to separate plumbing from truth.

Frequently Asked Questions

What is a data mesh?

A data mesh is a decentralised data architecture where domain teams own and serve their data as products. Instead of a central data team, each domain is responsible for data quality, contracts, and discoverability.

What is a data product in architecture terms?

A data product is a self-contained, discoverable, trustworthy dataset exposed by a domain team. It has defined ownership, SLAs, documentation, and versioning — treated like a software product rather than an ETL output.

How does data mesh relate to enterprise architecture?

Data mesh aligns data ownership with business domain boundaries — the same boundaries used in domain-driven design and ArchiMate capability maps. Enterprise architects play a key role in defining the federated governance model that prevents data mesh from becoming data chaos.