The Lakehouse Does Not Solve Ownership

⏱ 19 min read

Every few years, the data world discovers a new cathedral.

First it was the data warehouse. Then the data lake. Now the lakehouse arrives wearing both costumes and promising a truce: open formats, analytics performance, machine learning readiness, one platform for everyone. Executives hear “single source of truth” and relax. Architects hear “unified” and start drawing cleaner diagrams.

And then the old problem walks back into the room, untouched.

Who owns the customer status?

Who defines what “active policy” means?

Who is allowed to correct a shipment after invoicing?

Who decides whether revenue is recognized on order placement or fulfillment?

The lakehouse does not answer these questions. It cannot. Ownership is not a storage concern. It is a domain concern.

That distinction matters more than most architecture slide decks admit. Enterprises do not fail at data because parquet files are inefficient or because table formats are immature. They fail because multiple systems publish competing meanings for the same business fact, and nobody has the authority, language, or operating model to resolve the conflict. A lakehouse can centralize data. It cannot centralize accountability. In fact, if used carelessly, it makes ambiguity easier to scale.

This is the heart of the issue: data platforms organize bytes; domains organize meaning.

So if you are modernizing toward a lakehouse, or data mesh, or event-driven architecture, or a hybrid of all three, the architectural question is not “Where will the data live?” The better question is: Which domain owns which business facts, and how will the rest of the enterprise consume them without copying semantics into chaos?

That is the real architecture problem. The rest is implementation.

Context

The modern enterprise data landscape is a patchwork of intentions.

There are transactional systems built to run the business: policy administration, claims, ERP, CRM, billing, warehouse management, digital channels. There are operational microservices exposing APIs and emitting Kafka events. There are data pipelines pulling CDC streams into cloud storage. There is a lakehouse serving BI dashboards, feature engineering, and regulatory reporting. There might even be a data mesh initiative with domain-aligned data products. event-driven architecture patterns

On paper, this looks mature. In practice, the same customer appears in six places with six different states and four different definitions of “gold.” Finance trusts one version, sales trusts another, and compliance trusts neither.

The lakehouse often enters this picture as the new center of gravity. It promises ACID tables on open storage, batch and streaming convergence, scalable SQL, and support for BI and AI workloads. All good things. Real advances. Very useful.

But there is a dangerous leap in logic that happens in many enterprises:

  1. We have fragmented data.
  2. The lakehouse can consolidate fragmented data.
  3. Therefore the lakehouse can become the authoritative source of enterprise truth.

That last sentence is where architecture goes off the rails.

An authoritative source is not merely the place from which data is queried most often. It is the place with the legitimate right to define, create, validate, and change a business fact. That authority usually belongs to an operational domain system, not to an analytical platform. Sometimes it belongs to a carefully designed master data capability. Sometimes to a workflow engine or system of record. But rarely to a broad analytics substrate.

The lakehouse is an excellent place to integrate, reconcile, analyze, and publish curated views. It is a poor place to pretend ownership into existence.

Problem

When organizations treat the lakehouse as a cure for ownership, they create a new layer of confusion with better performance.

Here is the typical failure pattern.

A central data team ingests ERP orders, CRM accounts, e-commerce transactions, support tickets, and Kafka event streams into the lakehouse. They standardize schemas, create conformed dimensions, derive golden records, and expose polished tables to downstream users. Suddenly the lakehouse table named customer_360 becomes more trusted than any source application.

That feels like progress. Until the first hard business question arrives.

Why is a customer marked inactive in CRM but active in billing?

Why did the policy lapse date change after the monthly close?

Why does the inventory table show stock available for channels that the fulfillment domain has blocked?

Why does the lakehouse correction not flow back into the operational systems?

At that point the central team discovers the trap: they have aggregated data without owning the business process that creates it. They can detect inconsistency, but they cannot legitimately resolve it. So they start inventing rules. Those rules become undocumented policy. Soon the data platform is making business decisions that belong elsewhere.

That is architectural drift. Quiet, expensive drift.

The underlying problem has three parts:

  • Semantic collision: multiple systems represent similar concepts with different meanings.
  • Ownership ambiguity: no domain is clearly accountable for a business fact across its lifecycle.
  • Platform overreach: integration platforms, including lakehouses, begin substituting for domain governance.

This is why domain-driven design matters here. DDD is not just a modeling technique for microservices. It is a way of deciding where meaning lives. Bounded contexts are useful precisely because words lie when they travel. “Customer,” “order,” “claim,” “product,” and “status” all change shape depending on the business capability using them. Trying to flatten them prematurely into one universal enterprise schema is how programs spend millions creating elegant nonsense.

Forces

Real architecture is shaped by competing forces, not slogans.

1. Enterprises need integrated data

Executives want cross-domain insight. Regulators want traceability. Data science wants historical breadth. Operations wants near-real-time dashboards. The business has every right to expect integrated views.

2. Domains need autonomy

Operational systems cannot wait for the central platform to bless every schema change. Teams need to evolve their models with local speed, especially in microservices environments. microservices architecture diagrams

3. Semantics are local before they are global

A “customer” in marketing is a prospectable identity. In billing, it is a legal account party. In support, it is a service relationship. In insurance, the insured, payer, policy holder, and beneficiary may all be different people. The enterprise wants one noun. The domain gives you four.

4. The platform wants standardization

Lakehouse teams need common table formats, cataloging, governance, observability, lineage, and security controls. They cannot operate a platform where every dataset is a hand-crafted exception. EA governance checklist

5. Reconciliation is unavoidable

No matter how elegant the target architecture, large enterprises run overlapping systems for years. Mergers, regulatory products, regional platforms, and vendor packages guarantee duplication. Reconciliation is not a temporary embarrassment. It is a first-class capability.

6. Ownership has legal and operational consequences

If a report drives revenue recognition, solvency calculations, or patient safety decisions, the owner of the underlying fact matters. This is not just a modeling issue. It is an accountability issue.

These forces do not disappear because a vendor demo showed streaming upserts into ACID tables.

Solution

The practical answer is blunt: use the lakehouse as an integration and consumption platform, not as a substitute for domain ownership.

Start with domain ownership. Then design how owned facts are published, reconciled, and consumed.

This means adopting a few opinionated rules.

Rule 1: Every business fact has a system of origin and a domain owner

Not every copy is authoritative. Not every dataset deserves write-back rights. If the order fulfillment domain owns shipment status, then all enterprise views of shipment status must trace back to that ownership, even if they are transformed in transit.

Rule 2: The lakehouse may host certified projections, not invented truth

A curated customer_360 table is fine if it is clearly a composite analytical projection with documented derivation rules and source lineage. It is dangerous if users believe it is the place where customer truth is maintained.

Rule 3: Reconciliation logic is explicit and governed

When multiple sources disagree, the resolution policy must be visible: source precedence, recency, survivorship, confidence scoring, business review workflow, and exception handling. Reconciliation is a business capability, not just ETL glue.

Rule 4: Domain semantics are preserved at boundaries

Use DDD bounded contexts to resist the urge for a universal canonical model too early. Translate between contexts deliberately. A canonical event contract can be useful, but only after you know what you are losing and why.

Rule 5: Publish data products from domains where possible

Domains should expose events, APIs, and analytical datasets that represent their owned semantics. The lakehouse then composes those products for cross-domain analysis. That is healthier than central teams reverse-engineering ownership from database dumps.

Architecture

A robust architecture distinguishes between ownership, integration, and consumption. These are not the same layer, and forcing them together usually creates a mess with nice dashboards.

Architecture
Architecture

The key is what is not shown: the lakehouse does not magically seize ownership from the operational domains. It receives facts, preserves lineage, applies transformations, and may execute reconciliation. But the accountability for the core fact remains with the domain that creates and governs it.

Domain ownership model

To make this concrete, define ownership at the level of business facts, not just applications.

Domain ownership model
Domain ownership model

This is where domain semantics become operationally important.

Take “active customer.” In one enterprise it might mean:

  • has a current contract,
  • has purchased in the last 12 months,
  • is not blocked for collections,
  • has accepted current consent terms,
  • or is eligible for support.

Those are not synonyms. They are domain-specific predicates. If the lakehouse defines one enterprise-wide active_customer_flag without preserving the underlying semantics, you have not simplified the architecture. You have hidden disagreement in a column name.

A better pattern is to publish multiple explicit states:

  • billing_active_customer
  • marketing_contactable_customer
  • service_entitled_customer
  • policy_in_force_customer

Then, if the enterprise truly needs a composite notion, define it as a governed projection with visible logic and owners.

Kafka and microservices in the picture

In a microservices environment, Kafka often becomes the nervous system of operational truth propagation. That is useful, but it introduces another source of confusion: teams assume that publishing an event means they have solved integration semantics.

They have not.

An event named CustomerUpdated emitted by CRM tells you that CRM’s view changed. It does not tell you whether CRM owns all customer attributes enterprise-wide, whether downstream teams may overwrite them, or whether the event’s schema is stable enough for analytical consumers. Events propagate change. They do not erase bounded contexts.

A strong pattern is:

  • domains emit events for owned facts,
  • CDC is used where event maturity is low,
  • the lakehouse lands raw immutable history,
  • curated layers shape domain-aligned datasets,
  • reconciliation services handle overlap,
  • certified products expose agreed cross-domain views.

That architecture respects both operational reality and analytical need.

Migration Strategy

Nobody gets to this architecture in one leap. Large enterprises are already tangled. The right move is a progressive strangler migration, not a platform big bang.

The strangler pattern is usually discussed for applications, but it works equally well for data ownership. You do not replace all semantics at once. You progressively redirect authority.

Step 1: Inventory business facts, not just systems

Map the critical entities and events:

  • customer identity
  • account status
  • product eligibility
  • order acceptance
  • shipment confirmation
  • invoice issuance
  • payment allocation
  • policy inception
  • claim closure

For each, answer:

  • Who creates it?
  • Who can change it?
  • Who consumes it?
  • What are the legal or financial consequences?
  • What conflicts exist today?

This exercise is often more revealing than any data catalog.

Step 2: Mark systems of record and systems of reference

Be precise. A system of record for one fact may be a derived consumer for another. ERP may own invoicing but not customer consent. CRM may own contact preferences but not credit exposure.

Step 3: Land everything in the lakehouse, but preserve raw lineage

Do not start by over-modeling. Ingest source-aligned raw data with timestamps, provenance, and versioning. If using Kafka, retain event history. If using CDC, keep source transaction metadata. You will need this when reconciliation disputes begin.

Step 4: Build curated domain-aligned zones

Instead of one giant enterprise model, create curated datasets aligned to bounded contexts:

  • customer domain tables
  • billing domain tables
  • fulfillment domain tables
  • risk domain tables

Let each domain’s semantics remain legible.

Step 5: Introduce reconciliation as a named service

This is the turning point. Reconciliation should not be hidden in random SQL notebooks. It should be designed.

Step 5: Introduce reconciliation as a named service
Introduce reconciliation as a named service

The reconciliation service may be a set of pipelines plus workflow, or a dedicated MDM-style capability, or a rules engine combined with stewardship. The implementation varies. The architectural point does not: conflict resolution must be explicit.

Step 6: Move consumers from source-specific extracts to certified products

Do this gradually. Start with reporting and analytical use cases. Then selected operational read models. Avoid write-back until ownership and governance are mature. ArchiMate for governance

Step 7: Reduce semantic duplication in operational architecture

As you modernize domains and microservices, move business fact ownership closer to the teams that execute the process. Retire redundant source systems where feasible. The best reconciliation engine is the one you eventually need less often.

This is why migration strategy matters. A lakehouse implementation without ownership migration is just centralization. A strangler migration gradually makes the central platform cleaner because it narrows ambiguity over time.

Enterprise Example

Consider a multinational insurer.

It has:

  • a regional policy administration platform in Europe,
  • a separate legacy policy engine in North America,
  • Salesforce for broker and customer interactions,
  • a billing platform shared across products,
  • claims systems by line of business,
  • Kafka for digital channel events,
  • and a cloud lakehouse for analytics, actuarial models, and regulatory reporting.

Leadership wants an enterprise customer view and a policy profitability model. The initial plan is classic: ingest everything into the lakehouse, standardize the policy and customer schemas, and let the central data team create the golden truth.

That works for six months.

Then the cracks show.

The “policy active” flag in the lakehouse is driven by premium payment status because billing data is easiest to standardize. But the policy administration domain says a policy can be legally in force during grace periods even with outstanding payment. Claims uses legal coverage dates. Finance uses earned premium dates. Compliance uses product-specific cancellation rules that vary by jurisdiction.

One flag. Four meanings. Expensive confusion.

The architecture team resets the model.

They define ownership as follows:

  • Policy Domain owns policy lifecycle state, coverage dates, endorsements, cancellations.
  • Billing Domain owns invoices, receivables, arrears, payment allocation.
  • Claims Domain owns claim status and reserve changes.
  • Customer Domain owns party identity, consent, and contact preferences.
  • Lakehouse owns analytical projections, profitability measures, and reconciled enterprise reporting views.

Kafka events from digital channels and policy systems feed the lakehouse. CDC brings in legacy systems where events are absent. Raw data is preserved by source. Curated zones are built per domain. Reconciliation logic matches parties across regional systems and resolves party identity conflicts using survivorship plus stewardship workflow. The lakehouse publishes several explicit projections:

  • policy_in_force_view
  • billing_delinquency_view
  • enterprise_customer_party_view
  • regulatory_exposure_view

For executive reporting, it also publishes a composite customer_policy_health_view, but with lineage and definitions attached.

The business result is not merely cleaner data. It is cleaner argument. When disputes arise, teams know where to go. The platform stopped pretending to own semantics it did not control.

That is architecture doing its job.

Operational Considerations

Once ownership is clear, operations become more manageable, though never trivial.

Metadata and lineage

You need column-level lineage for critical facts, especially regulated reporting. Every certified dataset should answer:

  • which sources contributed,
  • which transformation rules applied,
  • whether reconciliation occurred,
  • what confidence or match score exists,
  • and who approved exceptions.

Data contracts

For Kafka topics, APIs, and published data products, establish contracts with schema compatibility rules, deprecation policies, and ownership metadata. A contract is not just fields and types. It should include semantic notes: what the event means, who owns it, and what it must not be used for.

Observability

Monitor:

  • freshness,
  • completeness,
  • schema drift,
  • reconciliation backlog,
  • exception rates,
  • source conflict frequency,
  • and data product consumption.

A platform may be “green” on pipeline health while silently publishing semantically wrong views. Technical observability without business observability is half a dashboard.

Security and access

Ownership also informs access control. Consent attributes owned by the customer domain, payment data owned by billing, and health or claims data owned by line-of-business domains may each require different controls in the lakehouse. One platform does not mean one access policy.

Stewardship

Some conflicts cannot be auto-resolved. Human review is part of the design. If your architecture assumes perfect machine matching across decades of merged enterprise data, it is a fantasy architecture.

Tradeoffs

There is no free lunch here. Anyone promising one is usually selling software.

Benefit: clearer accountability

The main gain is that semantic authority is traceable. Cross-domain reporting improves because consumers know the provenance and legitimacy of each fact.

Cost: more explicit modeling

You must do the hard work of mapping domains, defining bounded contexts, and documenting ownership. This is slower than dumping everything into one giant analytics model.

Benefit: safer evolution

Domains can change internals while preserving published contracts. The lakehouse can absorb change through raw landing and curated transformation layers.

Cost: reconciliation overhead

Explicit reconciliation capabilities require rules, workflows, stewardship, and operational discipline. This is real effort. But the hidden version of the same effort is worse.

Benefit: better migration path

A strangler approach supports coexistence across legacy and modern systems. You can modernize semantics incrementally.

Cost: less seductive simplicity

Executives often want one dashboard, one truth, one owner. The architecture answer is more nuanced: one enterprise platform, many domain owners, and governed composite views. That is harder to pitch, but truer.

Failure Modes

The common failure modes are painfully consistent.

1. Central team becomes shadow business owner

The lakehouse team starts deciding what “active,” “valid,” or “final” means because nobody else will. Short-term delivery improves. Long-term governance collapses.

2. Canonical model becomes semantic concrete

An enterprise-wide schema gets frozen too early and prevents domains from expressing necessary nuance. Teams then work around it with side fields, overloaded enums, and undocumented transformations.

3. Reconciliation is buried in ETL scripts

Rules become impossible to audit, reason about, or change. Every discrepancy becomes a forensic exercise in SQL archaeology.

4. Event-driven architecture is mistaken for ownership architecture

Kafka topics multiply, but no one can say which domain has final say over a disputed fact.

5. Analytical projections are used operationally without controls

A curated lakehouse table starts feeding transactional decisions. Now stale or composite data influences real-time operations without proper authority or latency guarantees.

6. Write-back fantasies

Someone proposes updating source systems from reconciled lakehouse golden records. Sometimes this is valid. Often it is a shortcut around proper master data and workflow design. If you do write back, the governance burden goes up sharply.

When Not To Use

This approach is not universal.

Do not over-engineer domain ownership architecture if:

  • you are a small organization with one operational platform and little semantic ambiguity,
  • your primary need is simple analytical consolidation with low regulatory impact,
  • the source system already cleanly owns the facts and there are few overlapping domains,
  • or the data platform is strictly read-only reporting with no need for certified cross-domain decisions.

Likewise, do not force heavy DDD language onto every reporting project. Bounded contexts are powerful when semantic collision is real. They are unnecessary ceremony when it is not.

And if your organization lacks the appetite to assign clear business ownership, a sophisticated reconciliation-and-data-product architecture may simply expose political problems faster. Useful, yes. Comfortable, no.

Several adjacent patterns matter here.

Data mesh

Useful when interpreted correctly: domain-oriented data products, federated governance, self-serve platform. Misleading when treated as permission for every team to publish whatever they want without semantic discipline.

Master data management

Still relevant. Especially for party, product, supplier, and location identity. MDM is not obsolete because the storage substrate changed. If anything, lakehouse architectures often make MDM-style reconciliation more important.

CQRS and read models

Good fit for publishing analytical or operational read-side projections derived from domain-owned events. But remember: read models are consumers of truth, not originators of it.

Event sourcing

Helpful in domains where temporal history and auditability matter. But event sourcing does not remove the need to define bounded contexts or ownership boundaries.

Strangler fig migration

Essential for gradually redirecting ownership and retiring semantic duplication. The enterprise rarely gets a clean slate.

Summary

The lakehouse is a strong architectural component. It can unify storage patterns, support batch and streaming, improve data governance, and accelerate analytics and machine learning. It is worth using.

But it does not solve ownership.

Ownership lives where business facts are created, validated, and changed under accountable domain authority. That is a domain-driven design problem before it is a platform problem. If you ignore that, the lakehouse becomes a beautifully managed warehouse of unresolved arguments.

The architecture that works is less glamorous and more durable:

  • define domain ownership of business facts,
  • preserve domain semantics instead of flattening them prematurely,
  • use Kafka, CDC, APIs, and data products to publish owned data,
  • land raw history with lineage in the lakehouse,
  • build curated and certified projections for enterprise consumption,
  • make reconciliation explicit,
  • migrate progressively with a strangler strategy,
  • and never confuse popularity of access with legitimacy of ownership.

A data platform can be central without being sovereign.

That is the line worth remembering.

Frequently Asked Questions

What is enterprise architecture?

Enterprise architecture aligns strategy, business processes, applications, and technology in a coherent model. It enables impact analysis, portfolio rationalisation, governance, and transformation planning across the organisation.

How does ArchiMate support architecture practice?

ArchiMate provides a standard language connecting strategy, business operations, applications, and technology. It enables traceability from strategic goals through capabilities and services to infrastructure — making architecture decisions explicit and reviewable.

What tools support enterprise architecture modeling?

The main tools are Sparx Enterprise Architect (ArchiMate, UML, BPMN, SysML), Archi (free, ArchiMate-only), and BiZZdesign. Sparx EA is the most feature-rich, supporting concurrent repositories, automation, scripting, and Jira integration.