⏱ 21 min read
Data duplication is one of those problems that starts life looking harmless. A team copies a customer table to speed up reporting. Another creates a curated product dataset because the original source is too slow, too cryptic, too political, or all three. A machine learning group snapshots order history into its own lakehouse zone “just for feature engineering.” None of this feels dramatic on day one. In fact, it often feels pragmatic.
Then six months pass.
Now there are seven versions of “customer,” four definitions of “active account,” and two executive dashboards that disagree during the quarterly board meeting. The organization has not merely duplicated data. It has duplicated meaning. And once meaning forks, governance becomes less about storage and more about trust. EA governance checklist
This is where many data mesh conversations go wrong. They celebrate domain ownership, product thinking, self-serve platforms, and federated governance—and rightly so—but become oddly romantic about duplication. The argument usually sounds modern: storage is cheap, compute is elastic, and teams should be autonomous. All true. But duplicated data is not free. It creates semantic drift, reconciliation overhead, compliance risk, and hidden coupling between domains that insist they are independent.
A good architect learns to ask a blunt question: what exactly are we duplicating? Raw facts? Derived views? Business events? Regulated attributes? Temporary caches? A domain-shaped projection? The answer matters because not all duplication is equal, and not all of it is bad.
In a mature data mesh, duplication is not prevented by decree. That would be naive. It is governed by intent, bounded by policy, and made visible through lineage, ownership, and explicit contracts. The goal is not to eliminate copies. The goal is to stop accidental copies from becoming accidental systems of record.
That distinction is the whole game.
Context
Data mesh emerged as a reaction to centralized data platforms that became bottlenecks. A central team collected data from every operational system, modeled it in one place, and became responsible for quality, access, transformations, onboarding, semantics, and often politics. It was a noble idea that usually collapsed under its own gravity.
Data mesh offers a more credible shape for the enterprise. Domains own their data as products. Platform teams provide paved roads. Governance becomes federated instead of fully centralized. Consumers discover and use data products without waiting for a central priesthood to bless every schema change. ArchiMate for governance
But once domains own data products, duplication reappears in a new form.
In traditional enterprise data warehousing, duplication was often centralized and at least somewhat visible: staging, raw vaults, marts, aggregates, dimensions. In data mesh, duplication is distributed across domains, analytical platforms, Kafka topics, object stores, serving layers, caches, feature stores, and compliance zones. It is harder to see because it is justified locally. Every team can tell a sensible story about why its copy exists. event-driven architecture patterns
That is precisely why governance matters more, not less.
The architecture question is not “how do we stop teams from copying data?” The architecture question is “how do we distinguish legitimate domain-aligned replication from dangerous semantic fragmentation?”
This is where domain-driven design helps. DDD reminds us that terms are only meaningful within bounded contexts. “Customer” in billing is not the same as “customer” in support, risk, or marketing. If a marketing domain duplicates customer data to run segmentation, that may be entirely proper—provided the duplicated representation is clearly owned, explicitly derived, and not pretending to be the enterprise source of truth for legal identity, credit status, or consent.
The mistake is not duplication itself. The mistake is unmanaged duplication across unclear semantic boundaries.
Problem
Organizations adopting data mesh quickly encounter a cluster of recurring problems:
- Semantic drift
Teams duplicate data and then reinterpret it. Column names remain familiar while definitions diverge. “Order date” becomes created date in one place and paid date in another.
- Unknown systems of record
A duplicated dataset becomes operationally convenient and quietly turns into the place people trust most. Governance documents say one thing; behavior says another.
- Compliance sprawl
Personal data, financial data, and regulated fields propagate into stores that were never designed for retention controls, masking, right-to-erasure workflows, or regional residency.
- Reconciliation cost
The enterprise starts burning money and credibility comparing one dataset with another, trying to explain variances that are structural rather than incidental.
- Producer-consumer lock-in
Consumers make local copies because producers are unstable, slow, under-documented, or politically inaccessible. Duplication becomes a workaround for poor product quality.
- Platform opacity
Kafka topics, CDC pipelines, lakehouse tables, materialized views, and reverse ETL flows multiply faster than lineage can keep up.
The result is a common enterprise smell: lots of data, weak confidence, and endless meetings about whose numbers are right.
A data mesh does not solve this by centralizing all duplication decisions. That would recreate the very bottleneck it was meant to remove. It solves it by making duplication a first-class governed act.
Forces
Several forces pull in opposite directions.
Domain autonomy vs enterprise coherence
Data mesh says domains should move independently. Governance says the enterprise cannot tolerate each domain inventing incompatible definitions for core concepts without consequence. The tension is real and healthy. Good architecture does not erase it; it manages it.
Performance vs correctness
Teams often copy data because upstream systems cannot handle analytical workloads or high fan-out consumption. A local read model is faster, safer, and cheaper. Yet every copied read model risks drifting from the originating facts. You buy speed by taking on synchronization debt.
Event-driven decoupling vs duplication blast radius
Kafka and event streaming encourage downstream materialization. That is often exactly right. Publish domain events, let consumers project what they need. But event-driven architecture can also create a thousand silent copies, each with slightly different join rules, deduplication logic, and late-arrival handling.
Product thinking vs hidden platform complexity
A data product should be easy to consume. But making duplicated data safe requires metadata, lineage, policy enforcement, retention, schema compatibility, reconciliation controls, and ownership models. Under the hood, “simple” is expensive.
Local optimization vs legal reality
A product analyst may only want a convenient denormalized user table. Legal and security teams, however, care that the table includes consent markers, residency-constrained attributes, and deletion obligations. Regulation does not care whether the copy was “just for analytics.”
Bounded context vs enterprise master data fantasies
DDD teaches us that different contexts can maintain different models. But enterprises still have cross-cutting obligations around identifiers, consent, legal entity, accounting truth, and customer contactability. If you ignore those, bounded context becomes an excuse for semantic anarchy.
This is why data duplication governance sits at the intersection of data architecture, domain design, platform engineering, and risk management. It is not a technical footnote. It is part of operating the business.
Solution
The practical solution is to classify duplication by purpose and govern each class differently.
That sentence sounds tidy. The implementation is not. But it works.
At the core, data duplication governance in a data mesh should define:
- why duplication is allowed
- what semantic status the copy has
- who owns its quality and policy obligations
- how drift is detected
- when the copy must be retired
A useful model separates duplicated data into several categories:
1. Operational replication
Copies created for resilience, locality, or service performance. Examples include read replicas, CQRS projections, cache-friendly views, or search indexes.
These are acceptable when they are clearly non-authoritative for business truth and rebuilt from authoritative sources or events.
2. Analytical replication
Copies optimized for BI, data science, experimentation, or trend analysis. These may denormalize, aggregate, or reshape source data heavily.
These are acceptable when transformation logic is transparent, lineage is preserved, and policy inheritance is enforced.
3. Domain translation
A domain duplicates another domain’s facts but maps them into its own bounded context. For example, risk may consume customer onboarding events and construct a “subject profile” shaped for fraud models rather than CRM workflows.
This is not only acceptable; it is often necessary. But the translated model must never masquerade as the producer’s canonical representation.
4. Temporary migration duplication
Data copied during modernization, strangler transitions, ERP decompositions, or warehouse-to-mesh shifts.
These are useful and often unavoidable. They become dangerous when “temporary” survives three budget cycles.
5. Illicit shadow duplication
Unregistered extracts, spreadsheets, unmanaged object store dumps, local marts, ML side stores, or copied tables with no declared owner.
These are the real problem. Not because they are always malicious, but because they are invisible.
A governance model should attach explicit controls to each category:
- semantic label: authoritative, derived, projected, transient, or cache
- owner: producing domain, consuming domain, or platform
- freshness SLA
- retention and deletion obligations
- privacy classification
- reconciliation requirement
- approved downstream uses
- schema contract type
- decommission criteria
The memorable line here is simple: every copy needs a passport. If a duplicated dataset cannot state where it came from, what it means, who owns it, and when it expires, it is not a data product. It is a future incident.
Architecture
A workable architecture for duplication governance in data mesh has four layers:
- Source and event layer
- Data product and projection layer
- Governance and metadata layer
- Reconciliation and control layer
Source and event layer
Operational systems and microservices emit domain events or expose CDC streams. Kafka is especially useful here because it supports decoupled fan-out and replay. But that convenience must be tempered with schema discipline. If teams publish vague events like CustomerUpdated with ambiguous payloads and no compatibility governance, they create a duplication factory.
The events should express domain facts with clear semantics, not just database row mutations where possible. CDC has its place, especially in migration and legacy extraction, but raw table changes are poor long-term contracts. They leak implementation details and force consumers to reconstruct meaning from mechanics.
Data product and projection layer
Each domain publishes data products with explicit contracts. These may be batch tables, streaming views, APIs, or event topics. Consumers are free to create local projections, but those projections inherit governance metadata from their sources.
This is where DDD matters most. A consumer projection is not a bad copy if it is clearly a model within a different bounded context. For example, “customer risk profile” is not a duplicate of “customer master.” It is a derived domain representation built from customer-related facts. Governance must preserve that distinction.
Governance and metadata layer
This layer carries the real weight:
- catalog entries for all registered data products and sanctioned copies
- lineage across batch, stream, and API transformations
- policy tags for PII, PCI, health, residency, and retention
- data contracts and schema evolution policies
- ownership and stewardship metadata
- usage declarations
Without this layer, federated governance is just a slogan.
Reconciliation and control layer
Reconciliation deserves its own place because duplication without reconciliation is faith-based architecture.
Not every copy needs continuous record-level reconciliation. That would be wasteful. But critical duplicated datasets should have an explicit reconciliation strategy:
- aggregate checksums
- record counts by key period
- key coverage checks
- semantic variance thresholds
- event lag monitoring
- late-arrival and out-of-order tolerance rules
- exception workflows
A retail bank, for instance, may tolerate some lag between card transaction events and marketing propensity models. It cannot tolerate disagreement between ledger-affecting balances and customer-facing statements. Governance must encode such asymmetry.
Migration Strategy
Most enterprises do not begin with a clean data mesh. They begin with a warehouse, a lake, several integration platforms, brittle ETL jobs, and an uncomfortable number of spreadsheets that are more business-critical than anyone admits. So duplication governance has to work during migration, not after some imagined future cleanup.
This is where the progressive strangler approach is the only sensible option.
You do not replace centralized data architecture in one move. You strangle it gradually by carving out domain-owned products, establishing metadata and policy controls, and governing duplication as the transition unfolds.
A practical migration sequence looks like this:
Step 1: Inventory the existing copies
Before introducing policy, discover reality. Identify duplicated datasets across warehouses, marts, lake zones, Kafka topics, reverse ETL pipelines, and service-owned stores. Group them by semantic subject: customer, product, order, account, claim, policy, supplier.
Do not start with technology. Start with business nouns.
The point is not a perfect inventory. The point is to expose the hidden topology of copies and decide which ones matter.
Step 2: Classify by duplication intent
For each copy, ask:
- Is this authoritative, derived, projected, cached, or transient?
- Who depends on it?
- What is the freshness expectation?
- Does it contain regulated data?
- Can it be rebuilt?
- Is it part of a migration path or just abandoned convenience?
This classification immediately separates healthy duplication from dangerous sprawl.
Step 3: Establish domain ownership and semantic boundaries
Attach each significant dataset to a domain and bounded context. This often reveals enterprise confusion. The CRM team may think it owns customer. The billing team may disagree. Legal may insist identity is separate from contact preference. Good. These arguments are architecture doing its job.
A domain map should decide where authoritative facts live and where translated models are expected.
Step 4: Introduce contracts and lineage on the most reused products
Do not try to govern every table at once. Start with highly shared products: customer, order, transaction, policy, inventory, product catalog. Add schemas, compatibility rules, ownership metadata, and policy tags.
For Kafka topics, that means schema registry discipline and event ownership. For batch products, it means catalog registration and lineage capture. For microservices, it means making API and event contracts explicit rather than tribal. microservices architecture diagrams
Step 5: Build sanctioned consumer projections
Where teams currently make unmanaged copies, provide a path to create approved projections. This is the practical compromise. If governance only says “no,” teams will route around it. If governance offers patterns, templates, and platform support, teams will usually comply.
Step 6: Add reconciliation where business risk justifies it
Not all duplication deserves the same controls. Prioritize:
- financial exposure
- customer communications
- compliance reporting
- regulatory submission
- critical ML decisioning
Step 7: Retire shadow copies through strangler replacement
As sanctioned data products mature, decommission old marts, extracts, side tables, and brittle ETL chains. Measure adoption. Make the retirement visible. If you do not actively remove old copies, migration only adds new layers without reducing complexity.
The strangler pattern matters because duplication is often highest during migration. There is old truth, new truth, and transition truth. Governance has to make that ambiguity survivable.
Enterprise Example
Consider a multinational insurer modernizing its claims, policy, and customer platforms.
Historically, the enterprise ran a central data warehouse fed by nightly ETL from core systems. Over time, regional business units built local marts. A fraud team created a separate claim-history store. Marketing copied customer and policy extracts into a SaaS CDP. Data science built feature tables in a cloud lakehouse. Meanwhile, a new claims platform introduced Kafka for near-real-time events.
On paper, the insurer had a data strategy. In practice, it had twelve versions of policy, five versions of claimant, and no consistent answer to a regulator’s question about data lineage.
The first instinct was to centralize harder. That would have failed. The warehouse team was already overloaded, and regions would not surrender autonomy.
So the insurer moved toward a domain-oriented model:
- Policy domain owned policy issuance, endorsements, and coverage facts.
- Claims domain owned claim lifecycle facts.
- Customer domain owned party identity and contactability.
- Fraud domain owned derived risk indicators and investigation outcomes.
- Finance domain owned booked financial truth.
Then came the important architectural move: they stopped talking about “single source of truth” as if it were one database. Instead, they defined authoritative facts by domain and permitted translated representations by context.
For example:
- Customer domain remained authoritative for legal identity and consent status.
- Claims domain duplicated selected customer attributes for workflow efficiency, but those were labeled as projected and non-authoritative.
- Fraud domain built a “claimant risk profile” by combining claims, customer, and external watchlist events. It was explicitly a derived model, not a master customer record.
Kafka helped, but only after discipline arrived. Early on, teams published broad events with poorly versioned payloads. Consumers interpreted them differently, creating more semantic spread. The insurer introduced schema registry rules, event ownership, domain event standards, and a catalog requirement for every consumer projection above a certain data volume or business criticality.
Reconciliation was where the architecture paid off. Claims and finance inevitably disagreed on timing because operational events and booked entries followed different processes. Instead of forcing fake real-time consistency, the insurer defined acceptable variance windows and reconciliation checkpoints:
- intraday projected claim reserves could diverge temporarily
- booked finance positions had end-of-day authority
- exceptions above threshold triggered investigation workflows
This was not elegant in the abstract. It was effective in the enterprise.
Within eighteen months, the insurer retired several regional marts, reduced duplicate PII stores, improved lineage for regulatory audits, and—most importantly—stopped arguing endlessly about whether a fraud model’s customer view was “wrong.” It was not wrong. It belonged to a different bounded context and was governed as such.
That is what mature duplication governance looks like: not less data, but less confusion.
Operational Considerations
Good governance dies quickly if it remains a slide deck. It has to become operational.
Metadata capture must be automatic where possible
Manual registration sounds virtuous and scales terribly. Platform tooling should auto-register topics, tables, storage objects, schema changes, and pipeline lineage. Humans should enrich semantics and ownership, not enter every technical detail by hand.
Policy enforcement should be embedded in the platform
PII tagging, masking, retention, deletion propagation, access control, and residency constraints should travel with the data product. If policy depends on each team remembering a checklist, failure is only a matter of time.
SLOs for freshness and reconciliation matter
Consumers need explicit expectations:
- event delivery lag
- batch publication windows
- projection refresh cadence
- reconciliation completion times
- acceptable variance ranges
The hidden truth is that many duplication disputes are really expectation disputes.
Schema evolution needs governance without paralysis
Kafka producers, APIs, and analytical tables evolve. Backward-compatible change should be easy. Breaking change should be visible, governed, and sometimes expensive. If all change is blocked, teams create shadow copies to move around governance. If all change is free, consumers break silently.
Deletion and correction workflows are non-negotiable
This is especially true for privacy regimes. If a source domain corrects or deletes a record, duplicated products need a propagation model. Some copies can be rebuilt. Some need compensating events. Some must be physically purged. Governance must make this explicit.
Rebuildability is a strategic property
A copy that can be replayed from events or reconstructed from source facts is safer than one maintained through opaque one-off scripts. Architects should prefer duplication patterns that preserve rebuild paths.
Tradeoffs
There is no pure solution here. Only tradeoffs made consciously or accidentally.
Governance increases friction
It adds metadata work, approval flows, ownership obligations, and platform constraints. Some teams will feel slowed down. They are not entirely wrong.
Too much control recreates the central bottleneck
If every duplicated dataset needs a committee meeting, the mesh collapses into old-school data governance theater.
Too little control produces semantic entropy
Autonomy without visibility produces exactly the kind of fragmented trust that data mesh was supposed to cure.
Event-driven duplication improves decoupling but can multiply state
Kafka lets consumers own their read models. Excellent. It also creates many local truths. Without contracts and reconciliation, that freedom turns expensive.
Canonical models reduce variance but often erase domain nuance
Enterprises love canonical schemas because they promise harmony. In practice, they often flatten real domain distinctions and become bureaucratic choke points. Bounded contexts are usually healthier than one giant enterprise ontology pretending to fit everyone.
My opinion is straightforward: prefer shared facts, local models. Govern the relationship between them ruthlessly.
Failure Modes
The most common failure modes are painfully familiar.
“Every copy is a data product”
No. Some copies are junk drawers with better branding. If a dataset lacks ownership, contract, discoverability, and lifecycle intent, calling it a product is marketing.
“Lineage later”
Lineage postponed is lineage abandoned. Once pipelines and topics sprawl, reconstructing derivation becomes archaeology.
“Kafka solves duplication”
Kafka distributes facts. It does not govern semantics. It can just as easily amplify duplication chaos.
“One golden customer”
This usually means one political compromise schema that satisfies nobody and drives teams to create local versions anyway.
“Temporary migration stores”
Temporary data stores have a way of reaching retirement age. If they do not have explicit sunset criteria, they are permanent.
“Reconciliation is too expensive”
Then disagreement will be even more expensive. The issue is not whether to reconcile, but where to apply it proportionately.
“Governance owned by a central data office alone”
Federated governance means domains share accountability. A central team can define standards and platforms, but domains must own the meaning and quality of what they publish and duplicate.
When Not To Use
There are situations where elaborate duplication governance in a data mesh is simply too much architecture.
Small organizations with limited domains
If you have a handful of systems, one analytics team, and little domain separation, a lighter centralized model may be better.
Low-regulation, low-criticality analytics
If data is mostly ephemeral product telemetry for experimentation and mistakes carry low consequence, heavy reconciliation and duplication registration may not be worth it.
Organizations without real domain ownership
If “domain” is just a renamed application team with no business accountability, data mesh governance will become ceremony. Fix operating model first.
Very immature platform capabilities
If you lack cataloging, lineage, policy automation, and contract tooling, declaring federated duplication governance may be aspirational fiction. Start by building platform foundations.
Stable centralized environments that already work
Not every enterprise needs a mesh. If a central warehouse model serves the business well, do not replace it to follow fashion. Architecture should solve your problems, not someone else’s conference talk.
Related Patterns
Several related patterns support duplication governance well:
- Bounded Contexts
Essential for distinguishing semantic translation from inconsistency.
- CQRS and Materialized Views
Useful for sanctioned operational duplication when write authority remains clear.
- Event Sourcing and Replayable Streams
Powerful when rebuildability matters, though not always necessary.
- Data Contracts
Critical for making producer-consumer expectations explicit.
- Master Data Management
Still relevant for a narrow set of cross-enterprise identity and reference concerns, but should not become an excuse to centralize everything.
- Strangler Fig Migration
The right way to move from warehouse-centric or monolithic data integration toward domain-aligned products.
- Data Reconciliation Services
Often overlooked, but vital for high-risk duplication scenarios.
Summary
Data duplication in a data mesh is not a design flaw. It is a design reality.
The real question is whether duplication is intentional, bounded, and governable, or whether it is just sediment left behind by organizational drift. Enterprises get into trouble when they confuse local convenience with enterprise truth, or when they celebrate autonomy while neglecting meaning.
The right architectural stance is neither “ban copies” nor “copies are cheap.” It is this: duplicate facts when necessary, duplicate semantics with extreme care.
Use domain-driven design to define bounded contexts. Let domains publish authoritative facts. Allow consumers to build local projections and translated models. Govern each copy with metadata, policy, ownership, and lifecycle. Use Kafka and streaming where event-driven materialization helps, but do not mistake transport for governance. Reconcile where risk demands. Migrate progressively with a strangler approach. And retire temporary duplication before it becomes institutional folklore.
A healthy data mesh does not pretend there is only one version of reality. It accepts that different domains see the world differently.
But it insists they say so clearly.
Frequently Asked Questions
What is a data mesh?
A data mesh is a decentralised data architecture where domain teams own and serve their data as products. Instead of a central data team, each domain is responsible for data quality, contracts, and discoverability.
What is a data product in architecture terms?
A data product is a self-contained, discoverable, trustworthy dataset exposed by a domain team. It has defined ownership, SLAs, documentation, and versioning — treated like a software product rather than an ETL output.
How does data mesh relate to enterprise architecture?
Data mesh aligns data ownership with business domain boundaries — the same boundaries used in domain-driven design and ArchiMate capability maps. Enterprise architects play a key role in defining the federated governance model that prevents data mesh from becoming data chaos.