The Lake Is Not the Architecture

⏱ 19 min read

Most enterprises don’t have a data architecture problem. They have a meaning problem wearing a data platform badge. enterprise architecture with ArchiMate

Someone buys a lake, or a mesh, or a streaming platform, or all three in a single ambitious quarter. A few teams stand up Kafka. Another team provisions object storage and calls it the foundation. There is a heroic slide with arrows, boxes, and a blue cylinder at the center. Executives nod because it looks modern. The vendors are delighted. Six months later, the company has more data than ever and less shared understanding than before.

This is not unusual. It is almost the default.

A lake is a place to put things. Architecture is the set of decisions that determines how the business changes safely. Those are not the same thing. One is storage. The other is responsibility, meaning, control, and flow. Confusing the two is one of the most expensive category errors in enterprise technology.

The practical fault line usually appears here: ingestion pipelines are mistaken for domain pipelines.

Ingestion is about capture. It moves bytes from where they are produced into some centralized substrate, often quickly and with minimal judgment. Domain pipelines are different. They express business semantics. They apply bounded context rules. They maintain identity, lineage, quality, and state transitions that the business actually cares about. They answer questions such as: what does “customer” mean here? when is an order committed? which system is allowed to declare a payment settled? how do we reconcile disagreements between systems of record?

If you build only ingestion, you get a bigger attic. If you build domain pipelines, you get an operating model.

That distinction matters even more in enterprises with microservices, event-driven systems, and Kafka-heavy integration. The temptation is to think that because events are flowing, architecture is happening. But raw motion is not design. A fire hose is not a supply chain. event-driven architecture patterns

This article makes a blunt argument: the lake is not the architecture. The architecture lives in the domain semantics layered over ingestion, in the contracts between bounded contexts, in the reconciliation logic, in the progressive migration strategy, and in the operational discipline that keeps data and events trustworthy under failure.

Context

Most large organizations got to their current state honestly. They accumulated systems over years: ERP, CRM, warehouse management, billing, e-commerce, claims, policy administration, manufacturing execution, customer support, and a scattering of bespoke applications built during urgent moments that somehow became permanent.

Then came three converging pressures.

First, business leaders wanted integrated reporting and machine learning. They needed a single view of customer, product, supplier, order, policy, or patient. The existing estate offered only fragments.

Second, product teams wanted autonomy. Microservices promised independent delivery, local ownership, and faster change. Kafka promised decoupled integration, event streams, and near-real-time reaction. microservices architecture diagrams

Third, regulators, auditors, and operations teams demanded more traceability, not less. “Where did this number come from?” became a first-class architecture question.

So enterprises did what enterprises do: they added a lake to collect data, then added stream processing to move faster, then added governance to recover from the first two. EA governance checklist

None of those decisions are wrong in isolation. But they become dangerous when the lake is cast as the center of architecture rather than a component in a larger set of domain decisions.

A good architecture starts with business capabilities and bounded contexts. It identifies where domain truth is created, where it is merely copied, where translation is necessary, and where reconciliation is unavoidable. It understands that ingestion is a technical concern in service of a semantic model. Not the other way around.

That sounds obvious. It rarely survives budgeting season.

Problem

The common anti-pattern looks like this:

Every source system publishes extracts or CDC streams.
Everything lands in a lake or streaming backbone.
A central team normalizes records into broad enterprise schemas.
Downstream consumers are expected to derive business meaning from those centralized feeds.

This feels efficient. It is also where architecture quietly leaks away.

Why? Because enterprise entities are not universal facts. They are context-bound concepts.

A “customer” in billing is the party responsible for payment. In CRM, it may be a prospect, household, or account hierarchy. In shipping, it may be a delivery recipient. In compliance, it may be the legally accountable person or organization. These are related concepts, not the same concept with slightly different columns.

When a central ingestion pipeline flattens them into one enterprise customer schema too early, it creates a semantic fiction. That fiction is then embedded into dashboards, APIs, ML features, downstream marts, and operational logic. Soon everyone argues over the definition of customer, but the real issue is architectural: the pipeline collapsed bounded contexts before the business was ready to make those distinctions explicit.

The same happens with orders, products, inventory, claims, payments, and policies. The central platform becomes a semantic battlefield.

Worse, ingestion-oriented thinking optimizes the wrong things. It celebrates throughput, freshness, and schema standardization, while neglecting:

source-of-truth boundaries
lifecycle state transitions
idempotency across domains
reconciliation of conflicting records
temporal correctness
legal and audit obligations
ownership of domain contracts

The result is a platform full of copied data and no trustworthy business narrative.

You can spot this failure mode in language. Teams say “the data is in the lake” as if location implies readiness. It does not. Data in a lake is often just data waiting for a real model.

Forces

Good architecture is forged by opposing forces, not slogans.

Speed versus meaning

Ingestion wants to move fast. Domain modeling wants to move carefully. Raw feeds can often be onboarded in days. Agreeing what a “fulfilled order” means across channels may take months. The enterprise must do both without pretending they are the same activity.

Centralization versus autonomy

A central platform team can standardize ingestion, security, observability, and tooling. That is useful. But domain semantics belong with domain teams. If central teams define the business meaning of every dataset, they become a bottleneck and usually get it wrong.

Reuse versus bounded context integrity

Executives love reuse. Architects should be more suspicious. Shared models reduce duplication, but they also spread coupling. The trick is to reuse infrastructure and platform patterns while preserving semantic boundaries.

Event-driven flow versus historical correctness

Kafka and microservices are excellent for propagating change. They are not magic. Late events, duplicates, out-of-order delivery, schema drift, and compensating business actions are facts of life. Enterprises need replay, temporal queries, and reconciliation, not just “real-time.”

Analytical convenience versus operational truth

Analytical consumers often want broad denormalized datasets. Operational systems require precise state machines and authority boundaries. Trying to serve both from one generic enterprise feed usually satisfies neither.

Migration urgency versus business continuity

No one gets to rebuild a global enterprise from scratch. The architecture must coexist with legacy systems, overlapping truths, contractual integrations, and reporting deadlines. This is where progressive strangler migration matters.

Solution

The answer is not “don’t build a lake.” The answer is to put the lake in its proper place.

Use a two-layer pipeline model:

Ingestion pipelines capture and preserve source data with minimal semantic transformation.
Domain pipelines transform, validate, reconcile, and publish business-meaningful data products inside bounded contexts.

That distinction sounds tidy. In practice, it changes everything.

Ingestion pipelines should answer:

How do we capture data reliably from source systems?
How do we preserve lineage, timestamps, source identifiers, and raw payloads?
How do we detect schema changes and ingestion failures?
How do we replay data safely?

Domain pipelines should answer:

What business concept does this represent in this bounded context?
Which system is authoritative for which attributes and state transitions?
How do we handle duplicates, late arrivals, and conflicting updates?
What is the identity model?
What reconciliation process closes the gap between systems?
What contracts do we publish to other domains?

That means the lake or streaming backbone becomes a substrate, not the architecture itself. The architecture emerges through domain-aligned processing and publication.

A strong enterprise design usually has these characteristics:

raw immutable landing zones or append-only event streams
clear separation between source capture and semantic transformation
bounded-context domain datasets or topics
explicit ownership per domain product
canonical models used sparingly and only where there is stable shared language
reconciliation services for cross-system consistency
auditability and replay as first-class concerns
a migration path that gradually shifts consumers from legacy extracts to domain products

This is domain-driven design applied to data and integration, not just to application code.

The bounded context remains the unit of semantic integrity. If that line blurs, so does everything downstream.

Architecture

Let’s make the distinction concrete.

Ingestion versus domain pipeline

The key point is that ingestion preserves. Domain pipelines interpret.

Raw ingestion should not prematurely collapse records into an “enterprise truth.” It should retain source fidelity so domain teams can reason from evidence. This also makes replay and forensic analysis possible when business rules change.

Then each domain pipeline takes responsibility for turning raw records into business-meaningful products. For example:

Customer domain resolves identity, survivorship, consent, segmentation, and lifecycle state.
Order domain models order placement, amendment, fulfillment, cancellation, and channel-specific semantics.
Payment domain tracks authorization, capture, settlement, reversal, chargeback, and ledger implications.
Inventory domain distinguishes on-hand, reserved, available-to-promise, in-transit, and quarantined states.

Notice how these are not generic data transformations. They are business semantics.

Domain contracts and bounded contexts

A healthy enterprise architecture does not force every team to consume raw topics or lake tables. Domain teams publish curated contracts.

This arrangement matters because domain events and data products are contracts. They should represent stable business facts meaningful to consumers, not raw implementation leaks from source applications.

A raw SAP table extract is not a domain contract.

A PaymentSettled event with business identifiers, settlement amount, settlement date, and authority source may be.

Reconciliation is architecture, not cleanup

Many programs treat reconciliation as a downstream data quality activity. That is a mistake. Reconciliation is how enterprises survive multiple systems asserting partial truth.

When one system says an order is shipped, another says invoiced, and a third says returned, the gap is not just dirty data. It reflects asynchronous processes, authority boundaries, and business lag. A serious architecture makes reconciliation explicit.

Reconciliation is architecture, not cleanup

Reconciliation should define:

comparison keys and identity strategy
timing windows
source authority by attribute and state
tolerance thresholds
exception handling workflows
replay and restatement procedures

If you skip this, your architecture will lie politely until quarter-end, when finance and operations discover they have different numbers and no common explanation.

Migration Strategy

This is where most architecture articles become dreamy. Real enterprises do not replace sprawling estates with one elegant domain model in a fiscal year. They migrate under pressure, with conflicting priorities and live business risk.

The right strategy is usually a progressive strangler migration.

Start by separating concerns:

establish standardized ingestion from legacy systems
preserve raw history and identifiers
build one domain pipeline where business pain is highest
publish domain contracts
move selected consumers from raw or legacy feeds to those contracts
run reconciliation in parallel
gradually retire old extracts and point-to-point interfaces

The migration is not from old technology to new technology. It is from opaque integration to explicit domain semantics.

A practical sequence often looks like this:

Phase 1: Stabilize ingestion

Introduce CDC, API capture, or batch ingestion into raw topics and object storage. Add lineage, metadata, schema compatibility checks, and replay capability. Do not over-model yet. The goal is reliable evidence.

Phase 2: Pick one bounded context

Choose a domain with visible business value and manageable boundaries: customer identity, order lifecycle, inventory availability, or payment status are common candidates. Avoid starting with a concept everyone fights over politically unless executive sponsorship is unusually strong.

Phase 3: Build a domain pipeline and contract

Model the business lifecycle. Define authoritative sources. Implement identity and state transitions. Publish domain events or datasets with clear ownership and SLA. This is the moment where architecture becomes legible.

Phase 4: Reconcile against legacy truth

Do not cut over immediately. Compare outputs to existing reports and operational systems. Quantify gaps. Many migration failures happen because teams assume semantic equivalence where none exists.

Phase 5: Strangle consumers

Move downstream systems one by one:

reporting marts
customer service dashboards
digital channels
compliance extracts
ML feature pipelines

Each move should reduce dependency on raw ingestion or old bespoke integrations.

Phase 6: Retire old pathways deliberately

Turn off only what you can observe. Keep replay and backfill plans ready. Legacy interfaces often support hidden users no one documented.

This migration path respects business continuity. It also forces clarity. You cannot publish stable domain contracts without deciding what the domain means.

Enterprise Example

Consider a global retailer with e-commerce, stores, and wholesale channels. It has SAP for finance, a cloud CRM, a separate order management system for online sales, store systems acquired through mergers, and a warehouse platform that emits events into Kafka.

Leadership funds a “unified retail data lake” to create one view of customer and order.

The first program wave does what many programs do:

ingests SAP, CRM, OMS, WMS, and point-of-sale data
lands everything in a lake
creates broad customer_master and order_master tables
publishes these tables as enterprise assets

Initially this looks successful. Data volumes are high, and many teams can query a central repository. But the cracks appear quickly.

The CRM customer includes prospects with no transactions.

The finance customer is an account structure tied to invoicing.

The store systems identify customers inconsistently.

E-commerce has guest checkouts and household profiles.

Marketing wants households.

Fraud wants cardholder patterns.

Compliance wants legally identifiable subjects.

The “customer_master” table turns into a compromise document encoded as SQL. Everyone uses it. No one trusts it.

Orders are worse. The e-commerce OMS treats order amendments as versioned changes. Store systems treat returns as separate transactions. Finance posts invoice facts later. Fulfillment has partial shipment events. The central order_master table cannot model the actual lifecycle, so it invents statuses that satisfy reports but not operations.

At this point the retailer has a lake, but not an architecture.

A better move is to redraw around domains.

Customer domain pipeline resolves party identity and consent using explicit survivorship rules, not universal semantics. It publishes a domain contract for customer interaction and service use cases.
Order domain pipeline models order lifecycle independently from finance posting. It emits meaningful events such as OrderPlaced, OrderAmended, OrderPartiallyFulfilled, OrderCancelled.
Payment domain pipeline tracks authorizations, captures, settlements, refunds, and chargebacks.
Reconciliation service compares fulfillment, invoicing, and settlement to produce reconciled commercial state and exceptions.

Kafka remains relevant, but in the right place. Raw operational events flow through it. Domain services consume and publish curated events. The lake still stores raw and conformed history for analytics and replay. But the business no longer pretends that one giant central table defines reality.

Over time, customer support applications switch from querying stitched lake tables to using customer and order domain APIs. Finance reporting consumes reconciled order-payment facts rather than joining raw operational feeds. Legacy nightly extracts are retired gradually. Mismatch exceptions become visible operational work, not hidden report drift.

This is the difference between integration plumbing and architecture.

Operational Considerations

Architects who ignore operations are just drawing expensive aspirations.

A domain-oriented pipeline architecture needs disciplined runtime behavior.

Data and event observability

Track freshness, completeness, schema evolution, duplication rates, lag, reconciliation exceptions, and consumer SLA breaches. The question is not merely “did the job run?” but “is the business signal trustworthy?”

Idempotency and replay

Kafka and distributed pipelines deliver the same lesson repeatedly: duplicates happen, retries happen, and reprocessing happens. Domain pipelines must be idempotent by design. Use stable business keys, event versioning, and replay-safe handlers.

Temporal modeling

Many enterprise disputes are really time disputes. Which state was true at 10:03 when the invoice posted? Architecture needs effective dates, processing dates, and event times. A “current snapshot” is useful, but insufficient.

Schema governance

Raw ingestion can tolerate drift better than domain contracts can. Consumer-facing events and datasets need compatibility rules, versioning, and change review. Breaking domain contracts casually is one of the fastest ways to lose platform trust.

Security and policy boundaries

Raw zones often contain sensitive data far broader than most consumers need. Domain products should publish the minimum viable semantic contract and enforce policy segregation. This is especially important for customer, healthcare, HR, and financial domains.

Exception operations

Reconciliation exceptions need owners, queues, escalation rules, and root-cause categorization. If exceptions disappear into a dashboard no one watches, the architecture becomes ceremonial.

Tradeoffs

There is no free architecture. There is only informed compromise.

What you gain

clearer domain ownership
more trustworthy business semantics
better support for microservices and event-driven integration
easier migration from legacy systems
improved auditability and replay
reduced coupling between source system structure and consumer usage

What you pay

more design effort up front
domain modeling work that cannot be delegated to generic ETL teams alone
longer time before enterprise-wide semantic convergence
duplicate-looking models across bounded contexts
operational complexity around reconciliation and contract management

That last point bothers some leaders. “Why do we have several customer-like models?” Because the business has several customer-like realities. Pretending otherwise only moves complexity into hidden joins, undocumented logic, and political meetings.

A little duplication at the semantic edges is often cheaper than false unification in the center.

Failure Modes

Let’s be honest about how this goes wrong.

1. The central team becomes the semantic bottleneck

A platform or data office decides it will define enterprise meaning for all domains. It cannot. The queue fills, domain experts disengage, and the central model becomes a graveyard of compromises.

2. Raw ingestion is exposed as a product

Teams publish raw CDC topics or raw lake tables and call them strategic assets. Consumers build directly on them. Later, when source systems change, every downstream team breaks together. This is accidental tight coupling dressed as openness.

3. Canonical data model overreach

The enterprise tries to define a single canonical model for customer, order, product, invoice, shipment, and payment before stabilizing bounded contexts. Progress slows to a crawl. The model either becomes abstract and useless or precise and politically impossible.

4. Reconciliation is postponed

Programs go live on the assumption that mismatches will be “cleaned up later.” Later arrives as audit findings, billing disputes, and executive escalations.

5. Event enthusiasm outruns business state modeling

Teams emit lots of Kafka events but do not define lifecycle state, authority, or compensating behavior. They have streams without narrative.

6. Migration is treated as big bang

A new platform is declared the future. Legacy feeds are switched off too early. Hidden dependencies surface. Confidence collapses. People retreat to spreadsheets and side databases.

None of these are exotic failures. They are the mainstream ones.

When Not To Use

This approach is not mandatory for every situation.

Do not build a full domain-pipeline architecture when:

the problem is purely analytical and not tied to operational semantics
data volumes and business criticality are low
there is one genuinely authoritative source and little need for cross-domain integration
the organization lacks stable domain ownership entirely
the cost of reconciliation and contract governance outweighs business value

For a small departmental reporting need, a simple ingestion-to-warehouse pattern may be enough. Not every dataset deserves a bounded context ceremony.

Likewise, if an enterprise is very early in platform maturity, it may need to first establish basic ingestion reliability, metadata, and security before it can sensibly pursue domain-oriented pipelines. You cannot do semantic elegance atop operational chaos.

The pattern is most valuable where multiple systems create overlapping business truth and where downstream decisions depend on trustworthy state, not just convenient access.

Several adjacent patterns support this approach.

Data products

A domain pipeline often publishes data products with explicit ownership, documentation, SLA, quality metrics, and access policy. This aligns well with data mesh ideas, provided teams do not confuse mesh with semantic anarchy.

Event-carried state transfer

Useful when microservices and domains need timely propagation. Dangerous when teams publish internal implementation details instead of stable business events.

Change Data Capture

Excellent for ingestion. Not a domain model. CDC gets facts out of systems; it does not decide what they mean.

CQRS and materialized views

Helpful for presenting different read models to different consumers while preserving domain semantics behind them.

Master data management

Sometimes useful, especially for identity-heavy domains. But MDM should not become a universal flattening machine. It works best when scoped and aligned to bounded contexts.

Strangler fig pattern

Essential for migration. Replace consumer dependencies progressively rather than attempting one heroic cutover.

Summary

The lake is valuable. It is just not the architecture.

Architecture begins where semantics begin: in bounded contexts, in domain ownership, in business state transitions, in contracts that consumers can trust, and in reconciliation logic that explains disagreement instead of hiding it. Ingestion pipelines move data into reach. Domain pipelines make it meaningful.

That is the core distinction.

If you remember one line, make it this: store once if you like, but model where the business meaning lives.

Enterprises that miss this build enormous platforms that remain strangely hollow. They collect everything and decide very little. Enterprises that get it right treat raw capture as evidence, domain pipelines as interpretation, and reconciliation as a first-class control loop. They migrate progressively, strangling old interfaces without betting the company on a weekend cutover. They use Kafka where streaming helps, microservices where boundaries are real, and lakes where preservation and scale matter.

But they do not confuse infrastructure with architecture.

And that confusion, more than any tool choice, is what separates a modern-looking estate from a modern enterprise.

Frequently Asked Questions

What is enterprise architecture?

Enterprise architecture aligns strategy, business processes, applications, and technology in a coherent model. It enables impact analysis, portfolio rationalisation, governance, and transformation planning across the organisation.

How does ArchiMate support architecture practice?

ArchiMate provides a standard language connecting strategy, business operations, applications, and technology. It enables traceability from strategic goals through capabilities and services to infrastructure — making architecture decisions explicit and reviewable.

What tools support enterprise architecture modeling?

The main tools are Sparx Enterprise Architect (ArchiMate, UML, BPMN, SysML), Archi (free, ArchiMate-only), and BiZZdesign. Sparx EA is the most feature-rich, supporting concurrent repositories, automation, scripting, and Jira integration.