⏱ 20 min read
Most data platforms fail in a strangely professional way.
They fail with excellent slideware, expensive tooling, and a very modern vocabulary. They fail while everyone nods at words like lakehouse, mesh, real-time, and self-service. They fail because the organization quietly makes one architectural mistake and then builds an empire on top of it: it confuses where data lands with what data means.
That confusion is not a minor modeling issue. It is the root of a lot of enterprise pain.
A lakehouse is useful. Often very useful. It can be the right place for ingestion, storage, replay, historical analysis, machine learning features, and large-scale processing. But a lakehouse is not, by itself, a data platform in the architectural sense that matters to the business. It does not define domain semantics. It does not resolve ownership. It does not tell you what a customer is, when an order is booked, whether a shipment is partial, or why revenue was recognized. It certainly does not make conflicting operational truths disappear.
Put bluntly: ingestion is plumbing; semantics is architecture.
And too many enterprises have spent the last five years perfecting the plumbing while neglecting the building.
This article makes an opinionated case: the right architecture separates ingestion infrastructure from domain semantic products. The lakehouse should be treated as a foundational data substrate, not the source of business meaning. Business meaning belongs closer to domain boundaries, explicit contracts, and governed semantic models. If you miss that distinction, you get a technically impressive swamp. If you embrace it, you get a platform that survives M&A, system replacement, audit scrutiny, and the daily malice of reality.
Context
The modern enterprise usually arrives at the lakehouse honestly.
There are too many systems. ERP, CRM, billing, WMS, e-commerce, policy admin, mobile apps, partner APIs, IoT feeds, and a growing population of SaaS platforms that each export a slightly different CSV-shaped opinion of the truth. Teams are tired of brittle nightly ETL. Data scientists want access. Finance wants consistency. Operations wants dashboards that don’t lie. Compliance wants lineage. Executives want “one version of the truth,” a phrase that has ruined more architecture discussions than almost any other.
So the organization builds a central landing zone. Then a curated layer. Then some gold tables. Then reverse ETL. Then maybe streaming ingestion through Kafka. Then a semantic layer gets mentioned, but in practice semantics are still reconstructed downstream by analytics teams with SQL, notebooks, BI tools, and tribal knowledge. event-driven architecture patterns
This is where things start to drift.
The lakehouse becomes the default answer to every data question. Need customer lifetime value? Put it in the lakehouse. Need regulatory reporting? Lakehouse. Need cross-channel order status? Lakehouse. Need master data resolution? Lakehouse. Need operational event history? Lakehouse. Need machine learning training data? Also lakehouse.
Soon the platform team is running a centralized factory for every unresolved business disagreement in the company.
That is not scale. That is architectural debt with parquet files.
The better way begins with a simple distinction:
- Ingestion architecture answers: how do events, files, CDC streams, APIs, and external data arrive reliably?
- Domain semantics architecture answers: what do these facts mean, who owns the meaning, how are concepts defined, and how do consumers trust the resulting products?
Those are different concerns. They can be connected. They should not be collapsed.
Problem
The core problem is that raw and curated storage layers are often asked to do work that belongs to domain design.
A lakehouse is very good at collecting data in many shapes and retaining history cheaply. It is much less good at settling semantic disputes between domains with different incentives, timing, and definitions.
Take “customer.” Sales may define customer as an account with a signed agreement. Billing may define it as an entity with an active receivable relationship. E-commerce may define it as a registered user. Service may define it as an installed base location. Compliance may define it according to legal entity hierarchy. If the architecture assumes these can be solved by simply centralizing all source records and “curating” them into a canonical table, then the platform team becomes a reluctant priesthood of business meaning.
That is how bottlenecks are born.
The symptoms are familiar:
- Hundreds of bronze/silver/gold tables with unclear ownership
- Kafka topics that mirror database tables but carry no business event meaning
- BI teams rewriting metric logic repeatedly
- “Customer 360” programs that never stabilize
- Reconciliation disputes between operational systems and analytics outputs
- Platform teams forced to interpret domain rules they do not own
- System migrations blocked because too many downstream consumers bind directly to source-shaped data
The underlying issue is not insufficient tooling. It is that the architecture has centered the integration substrate rather than the domain boundary.
This matters even more in event-driven and microservices-heavy estates. Kafka can transport facts; it cannot assign business meaning by magic. A topic named orders is not an order domain model. CDC from an order table is not an order lifecycle event stream. If you publish low-level state mutations without semantic contracts, you simply move the confusion faster.
Speed without meaning is a very efficient path to mistrust.
Forces
Several forces pull enterprises toward the wrong shape.
1. The gravitational pull of centralization
A lakehouse is visible. It has cost curves, vendors, dashboards, and platform teams. Domain semantics are messier. They require negotiation with business units, bounded contexts, ownership models, and governance that works through accountability rather than decree. Central platforms are easier to fund. Domain design is harder to fake. EA governance checklist
2. The desire for reuse
Everyone wants a reusable canonical model. That instinct is understandable and dangerous. Shared semantics are valuable, but premature canonicalization often destroys domain nuance. You end up with a generic model that satisfies nobody and leaks source-system assumptions everywhere.
3. Migration pressure
Legacy systems are being replaced constantly: ERP modernization, CRM replatforming, commerce rebuilds, policy engine replacement, warehouse automation, core banking transformations. If consumers are tightly coupled to source-specific extracts in the lakehouse, every migration becomes a data blast radius event. The organization then realizes too late that it lacked stable semantic interfaces.
4. Real-time expectations
Streaming architecture and Kafka raise the stakes. Business leaders now expect fresh data, not just nightly snapshots. But freshness amplifies semantic ambiguity. If a “shipment delivered” event can arrive before billing closes, before returns windows start, and before partner acknowledgment, what exactly should downstream consumers believe?
5. Audit and compliance
Lineage is not enough. Auditors and regulators care about definitional integrity, controls, and reconciliation. A technically traced pipeline that moves an ambiguous metric from A to B is still ambiguous.
6. Organizational reality
The teams who understand business meaning are rarely the same teams who operate the data substrate. That separation is not a flaw. It is a fact. Good architecture reflects it.
Solution
The solution is to stop treating the lakehouse as the business brain of the enterprise.
Use it as a data substrate for ingestion, persistence, processing, replay, and analytical scale. Then build a layer of domain-aligned semantic products above and around it, with explicit ownership, contracts, and reconciliation logic. In domain-driven design terms, the key move is to separate bounded contexts from transport and storage concerns.
A sane architecture usually has three distinct concerns:
- Ingestion fabric
Handles CDC, events, files, APIs, Kafka streams, partner feeds, schema registration, retention, and observability. Its job is reliable movement and preservation of facts.
- Domain semantic products
Owned by domain teams or federated data product teams. These products define business entities, events, states, metrics, and contracts in a bounded context. They reconcile source ambiguity and expose trustworthy interfaces.
- Consumption and composition layer
BI, ML, operational analytics, regulatory reporting, data science, APIs, downstream applications. Some consumers use raw-ish historical data; many should consume semantic products instead.
That separation changes everything.
Instead of asking the lakehouse to produce “the enterprise customer,” you define domain-specific semantic products such as:
- customer billing relationship
- retail shopper profile
- service install base customer
- legal entity hierarchy
- cross-domain customer reference product, if and only if the business genuinely needs one
Notice the difference. Semantics are no longer assumed to collapse into one table. They are modeled deliberately.
A good line to remember is this: raw data should be easy to land; trusted meaning should be hard to earn.
A reference architecture
In this model, the lakehouse is essential but not sovereign. Domain products derive from it, and sometimes also directly from streams or operational stores, but they carry the business contract.
This is deeply aligned with domain-driven design:
- bounded contexts define meaning
- upstream/downstream relationships are explicit
- anti-corruption layers protect consumers from source churn
- ubiquitous language belongs in the semantic product, not buried in ingestion pipelines
A lot of so-called modern data architecture is really just integration architecture wearing analytics clothes. The fix is to bring back domain thinking. integration architecture guide
Architecture
Let’s make this concrete.
Ingestion vs semantics
The ingestion fabric should optimize for:
- broad connectivity
- high reliability
- replayability
- immutable history where useful
- schema evolution handling
- operational metadata
- low-friction onboarding
It should not be where teams casually invent business definitions.
Semantic products should optimize for:
- explicit ownership
- business vocabulary
- versioned contracts
- reconciliation and exception handling
- quality controls tied to business meaning
- consumer trust
- resilience to source-system replacement
That means a semantic product is more than a table. It is a package:
- model definitions
- transformation logic
- business rules
- data quality assertions
- lineage
- reconciliations
- SLA/SLOs
- support model
- change policy
This is where many “data mesh” conversations go wrong. They talk about ownership but skip semantics. Ownership without semantic contracts is decentralization of chaos.
Domain events are not CDC
In Kafka-centered estates, one of the most expensive mistakes is to treat CDC streams as business events. They are not the same.
CDC tells you what changed in a database row. A domain event tells you something meaningful happened in the business. Those differ in timing, granularity, and intent.
For example:
- CDC:
orders.status changed from 2 to 3 - Domain event:
OrderBooked - CDC:
shipment.delivery_timestamp updated - Domain event:
DeliveryConfirmedByCarrier - CDC:
invoice.paid_flag true - Domain event:
PaymentSettled
The semantic product should often translate low-level technical emissions into meaningful domain facts, preserving lineage back to raw records. That translation layer is where bounded context knowledge lives.
Reconciliation is part of the architecture, not cleanup
Enterprises routinely under-architect reconciliation. Then they discover that every serious business process depends on it.
A robust semantic product must answer:
- How does this product reconcile to source systems of record?
- What are acceptable variances?
- How are late-arriving records handled?
- How are duplicates, reversals, cancellations, and re-statements represented?
- What is the control point for period close or audit?
- Which truth is provisional, and which is authoritative?
Reconciliation is not an afterthought. It is how trust survives asynchronous systems.
Notice the asymmetry here: semantic products are not merely transformed data sets. They are controlled interpretations with feedback loops for exceptions.
Semantic layers should be plural, not singular
There is no law of nature saying the enterprise gets exactly one semantic layer. In practice, you often need:
- domain semantic products for operational trust
- conformed analytical models for enterprise reporting
- feature-oriented abstractions for data science
- regulatory views with tightly governed definitions
Trying to flatten these into one universal model usually ends badly. Better to acknowledge multiple semantic viewpoints and manage their relationships explicitly.
Migration Strategy
Most organizations cannot stop the world and redesign their data platform from first principles. Nor should they. The right move is a progressive strangler migration.
Do not rip out the lakehouse. Reframe it.
Start by identifying where the central platform is currently serving as an accidental semantic authority. Then peel those responsibilities into domain-aligned products one by one, leaving ingestion and storage intact where they still add value.
A practical strangler path
- Map current consumers
Identify dashboards, reports, data science assets, regulatory extracts, APIs, and operational dependencies that consume lakehouse tables directly.
- Classify data assets
Separate:
- raw landed assets
- technical integration assets
- implicit semantic assets
- enterprise reports and metrics
This reveals where semantic logic is currently hidden.
- Select high-value bounded contexts
Start with a domain where ambiguity is painful and ownership is clear:
- order lifecycle
- invoice and payment status
- inventory availability
- customer billing relationship
Not “enterprise customer” unless you enjoy suffering.
- Create explicit semantic contracts
Define entities, events, state transitions, quality rules, and reconciliation controls. Version them.
- Introduce anti-corruption layers
Shield consumers from source schemas and migration churn.
- Run semantic products in parallel
Publish side-by-side with existing curated tables. Measure variances. Build confidence.
- Cut over consumers gradually
Prioritize high-trust use cases first. Leave exploratory use cases on raw/curated layers longer.
- Retire accidental semantic assets
Once consumers have migrated, demote old curated tables to technical artifacts or archive them.
The migration logic that matters
This strategy works because it respects operational reality:
- source systems will continue to change
- consumers cannot all migrate at once
- definitions require negotiation
- trust is earned through reconciliation, not slogans
It also preserves optionality. If you later replace ERP or CRM, consumers bound to the semantic product remain insulated. That insulation is one of the most underappreciated payoffs in enterprise architecture.
A strangler migration is boring in the right ways. It reduces risk by making meaning explicit before systems are replaced.
Enterprise Example
Consider a global manufacturer with three major business lines and a decade of acquisitions. It runs SAP for core finance in some regions, Oracle ERP in others, Salesforce for account management, a custom e-commerce stack, multiple warehouse systems, and regional billing platforms. Over time it built a large lakehouse with CDC from major systems, Kafka for application events, and a substantial BI estate.
On paper, this looked modern.
In practice, the central data team had become the referee for endless disputes:
- What counts as a booked order?
- Which customer hierarchy should revenue roll up against?
- When is inventory “available” if it is allocated but not yet picked?
- How should returns and warranty replacements affect sales metrics?
- Why do finance, supply chain, and commerce dashboards disagree?
The breaking point came during an ERP migration in Europe. Downstream reports and ML models were tightly coupled to source-shaped curated tables built from old ERP extracts. Every source field change triggered remediation across dozens of pipelines and reports. Kafka helped move more data faster, but most topics were CDC-shaped, so downstream consumers still encoded source-specific assumptions.
The architecture team changed course.
They kept the lakehouse and Kafka backbone. But they introduced domain semantic products for:
- Order Lifecycle
- Billing Relationship
- Inventory Position
- Product Commercial Hierarchy
Each product had a named owner, explicit schema contracts, business rules, reconciliation to source-of-record controls, and a support model. The Order Lifecycle product, for example, defined events such as OrderCaptured, OrderBooked, OrderAllocated, OrderShipped, DeliveryConfirmed, OrderReturned, each with rules for late-arriving changes, split shipments, cancellations, and restatements.
Importantly, the product did not pretend there was a single universal order truth. It described one bounded context for enterprise reporting and cross-channel operations, while preserving lineage to regional systems.
During the ERP migration, consumer dashboards and supply chain analytics were moved from old curated ERP-shaped tables to the new semantic product. The source mappings changed significantly behind the scenes. Most consumers did not care. That was the point.
Results after 12 months were not miraculous, but they were real:
- materially fewer downstream breaks during migration releases
- much faster reconciliation for period-end order and revenue controls
- reduced duplication of metric logic across BI teams
- clearer accountability when definitions changed
- improved trust in inventory and fulfillment reporting
The company did not achieve a metaphysical single source of truth. It achieved something better: stable, governed truths for specific purposes.
That is how grown-up enterprises work.
Operational Considerations
A semantic architecture lives or dies in operations, not just design.
Ownership model
Every semantic product needs a real owner. Not a committee. Not a mailbox. A team with authority over definitions, quality thresholds, release cadence, and consumer communication.
Platform teams own the substrate.
Domain-aligned teams own meaning.
If that line blurs, the old failure pattern returns.
Data quality as business control
Quality checks should not stop at null counts and schema drift. Those matter, but they are table stakes. You also need:
- state transition validation
- duplicate business event detection
- period completeness checks
- source-to-product control total reconciliation
- threshold alerts on business metric variances
- late-arrival and correction monitoring
Versioning and change management
Semantic contracts need semantic versioning discipline. Breaking changes should be rare and intentional. Additive evolution should be preferred. Consumer compatibility matters more here than in raw ingestion.
Metadata and discoverability
Catalogs should expose:
- business glossary definitions
- owner and support contacts
- upstream sources
- downstream dependencies
- quality status
- reconciliation status
- usage guidance
- “do not use for” warnings
A catalog that only lists technical schemas is a phone book, not a platform.
Streaming and batch coexistence
Most enterprises need both. Some semantic products may be micro-batched because correctness matters more than immediacy. Others may expose near-real-time views using Kafka and stream processing. The architectural question is not “real-time or batch?” but “what latency is acceptable for this semantic contract?”
Fresh wrong data is still wrong.
Tradeoffs
This architecture is better, but not free.
Tradeoff: more modeling effort upfront
You spend more time defining bounded contexts, ownership, and contracts. This can feel slower than dumping everything into curated tables. It is slower at first. Then it is much faster when systems change.
Tradeoff: duplication across domains
Some concepts will appear in multiple semantic products with different definitions. That is not always waste. Sometimes it is the cost of preserving business meaning. Forcing convergence too early is often more expensive.
Tradeoff: federated accountability is harder
Central platform teams are easier to manage on org charts. Federated semantic ownership requires stronger product thinking and governance. Some organizations are not culturally ready. ArchiMate for governance
Tradeoff: reconciliation overhead
Building controlled reconciliations takes time and operational discipline. But if the use case matters to finance, compliance, supply chain, or executive decision-making, that cost is not optional. You either pay for reconciliation explicitly or pay for mistrust indefinitely.
Tradeoff: not every data set deserves semantic product treatment
Exploratory, low-risk, or one-off analytical data may not need this level of rigor. Architecture should be selective.
Failure Modes
Even good ideas have reliable ways to fail.
1. Rebranding the curated layer as “semantic”
A team renames silver/gold assets as semantic products without changing ownership, contracts, or reconciliation. Nothing improves.
2. Creating a universal canonical model
The architecture attempts to force all domains into one enterprise ontology. Progress stalls in endless governance meetings. Teams bypass the platform.
3. Confusing event transport with semantic design
Kafka topics proliferate, but they are still low-level technical emissions. Consumers reassemble meaning themselves. The organization now has distributed ambiguity.
4. Platform team owning business semantics indefinitely
This creates a bottleneck and political friction. Platform teams should enable, not arbitrate every definition.
5. Ignoring exception workflows
Reconciliation finds mismatches, but nobody owns remediation. Exceptions pile up. Trust collapses.
6. Overengineering low-value domains
If every data asset must go through full product governance, the platform becomes bureaucratic. Selectivity matters.
A useful test is this: if the business would call a meeting when this number changes, it probably deserves explicit semantic architecture.
When Not To Use
This approach is not universal.
Do not invest heavily in domain semantic products when:
- your use case is primarily exploratory analytics on loosely governed data
- the organization lacks stable domain ownership and cannot sustain product accountability
- the data has short-lived tactical value
- consumer trust requirements are low
- your platform is at a very early maturity stage and basic ingestion reliability is still unsolved
In those cases, focus first on solid ingestion, storage, metadata, and basic curation. You can add semantic products later where value is clear.
Also, do not mistake this pattern for a license to decompose everything into tiny data products. If your domains are weakly understood, excessive fragmentation will hurt more than help. Bounded contexts need to be discovered, not guessed from org charts.
Related Patterns
This architecture sits near several familiar patterns, but it is not identical to any one of them.
Data mesh
Useful for emphasizing domain ownership and product thinking. Dangerous when interpreted as “let every team publish whatever they want.” Mesh needs strong semantic contracts and a capable platform.
Medallion architecture
Helpful as an ingestion and refinement pattern. Insufficient as a semantic architecture. Bronze/silver/gold says little about business meaning.
Canonical data model
Sometimes useful at integration boundaries. Often overused. Canonical models should be narrow and purposeful, not a universal religion.
CQRS and event sourcing
Relevant for operational systems where domain events and read models are explicit. They can inform semantic product design, especially for lifecycle-oriented domains, but most enterprises will still need reconciliation across heterogeneous systems.
Master data management
Important for reference entities and identity resolution. But MDM does not replace bounded-context semantics. Matching records is not the same as defining meaning.
Strangler fig migration
Highly relevant. It is the right migration metaphor here: create stable semantic interfaces, move consumers gradually, then replace underlying systems without breaking the world.
Summary
The lakehouse is valuable. Keep it. Invest in it. Use it for ingestion, persistence, replay, scalable processing, and broad analytical access.
But stop asking it to be the sole custodian of business meaning.
A data platform worthy of the name must distinguish data arrival from data semantics. The first is an infrastructure problem. The second is a domain design problem. Conflating them creates central bottlenecks, weak trust, brittle migrations, and endless definitional disputes.
The better architecture is domain-driven:
- ingestion fabric for movement and history
- semantic products for owned business meaning
- reconciliation as a first-class control
- Kafka and microservices used where they help, not as semantic substitutes
- progressive strangler migration to escape source-shaped coupling
The memorable line is simple because the lesson is hard-earned:
Your lakehouse can store the facts. It cannot decide what they mean.
That responsibility belongs to domains, contracts, and the architecture disciplined enough to separate plumbing from truth.
Frequently Asked Questions
What is a data mesh?
A data mesh is a decentralised data architecture where domain teams own and serve their data as products. Instead of a central data team, each domain is responsible for data quality, contracts, and discoverability.
What is a data product in architecture terms?
A data product is a self-contained, discoverable, trustworthy dataset exposed by a domain team. It has defined ownership, SLAs, documentation, and versioning — treated like a software product rather than an ETL output.
How does data mesh relate to enterprise architecture?
Data mesh aligns data ownership with business domain boundaries — the same boundaries used in domain-driven design and ArchiMate capability maps. Enterprise architects play a key role in defining the federated governance model that prevents data mesh from becoming data chaos.