Data Platform Migration Requires Reconciliation

⏱ 19 min read

Most data platform migrations fail for a mundane reason: not because the technology is weak, not because the cloud bill explodes, and not because Kafka was configured badly on a Friday night. They fail because teams treat migration as motion instead of judgment. event-driven architecture patterns

Moving data is easy. Knowing whether it means the same thing after the move is the hard part.

That distinction matters more than architects like to admit. In enterprise programs, the old platform is rarely just “legacy.” It is sediment. Every report, every compliance extract, every customer segmentation rule, every half-forgotten finance adjustment is embedded in it. The platform doesn’t merely store data; it encodes business history, operational exceptions, and institutional compromise. A migration that ignores that reality becomes a very expensive form of self-deception.

This is why reconciliation sits at the center of any serious data platform migration. Not as an afterthought. Not as a QA phase. As the architecture.

If you are replacing a warehouse, introducing a lakehouse, moving from batch ETL to streaming, or splitting a monolithic integration hub into domain-aligned pipelines, the essential problem remains the same: for some period, two worlds will coexist. The old one pays the bills. The new one promises a better future. Between them lies a dangerous stretch of water where the same customer, order, claim, or product may appear twice, differently, or not at all.

A good migration architecture acknowledges this. A bad one waves it away with “parallel run” slides and optimistic color coding.

The right mental model is not “copy and cut over.” It is “progressively strangler migration with explicit reconciliation of business semantics.” Domain-driven design helps here because it forces the question many data programs avoid: what does this data actually mean inside a business domain, and who is allowed to define that meaning? Once you ask that question, the migration becomes less about tables and pipelines and more about preserving bounded contexts while changing the machinery underneath.

That is the real work.

Context

Most enterprises inherit a layered data estate that was never designed as a whole. There is usually a central warehouse, one or more operational data stores, an ETL tool everyone complains about but nobody dares to remove, a reporting landscape full of edge cases, and increasingly a newer stack: Kafka, cloud object storage, a lakehouse engine, dbt-style transformations, and domain-oriented microservices emitting events. microservices architecture diagrams

The pressure to migrate comes from several directions at once:

legacy warehouse cost or end-of-life
inability to support near-real-time use cases
business demand for self-service analytics
platform consolidation after mergers
modernization toward cloud-native services
poor developer productivity in tightly coupled ETL estates
need for better lineage, governance, and auditability

Yet “modernization” is too blunt a word. Replacing an old warehouse with a new one is not like replacing a web server. Data platforms sit downstream of almost everything. Their behavior is inferred by hundreds of consumers, many undocumented. Every migration touches finance, operations, regulatory reporting, customer management, and executive metrics. If a microservice migration breaks a minor endpoint, a team can route around it. If a data migration alters margin calculations by 0.8%, the CFO notices.

This is why the migration architecture must begin with semantics and trust, not technology choices.

Problem

The usual migration story sounds simple:

replicate source data to the new platform
rebuild transformations
validate outputs
cut over reports and downstream systems
retire legacy

In practice, every one of those steps is ambiguous.

What counts as the same data? Row count parity is not semantic parity. A customer record may have equal counts but differ in survivorship logic. Revenue may align at a daily level and still be wrong by product family because time zone handling changed. Claims may reconcile by total amount but fail materially because status transitions are interpreted differently in an event stream than in a batch snapshot.

The hard problems are usually these:

source systems are inconsistent and have been “normalized” implicitly by old ETL logic
the legacy platform contains undocumented business rules
batch pipelines collapse multiple operational events into a single state
the new platform introduces event-driven models with different timing characteristics
reference data changes arrive in a different order
historical backfills cannot reproduce the exact transactional sequence
downstream consumers rely on quirks they mistake for business truth

This creates the central migration risk: the old and new platforms may both be internally coherent yet disagree materially.

That is why reconciliation is not just data quality. It is controlled comparison between two interpretations of business reality.

Forces

A good architecture emerges from the forces acting on it. Data platform migration has several, and they pull in different directions.

1. Business continuity versus architectural improvement

The enterprise wants a better platform, but it also wants yesterday’s reports to match today’s reports. Those are not naturally compatible goals. Real modernization changes storage engines, processing models, partitioning strategies, and often data ownership boundaries. The more you improve the architecture, the more you risk drift.

2. Domain autonomy versus enterprise consistency

Domain-driven design argues, correctly, that sales, finance, supply chain, and customer service each have different models. “Customer” in marketing is not the same thing as “customer” in billing. During migration, however, executives often ask for one canonical truth. This tension must be handled honestly. A migration cannot solve enterprise ontology by decree.

3. Streaming freshness versus reconciled correctness

Kafka and event-driven microservices enable low-latency pipelines. But speed reveals ambiguity. Streaming systems surface late events, duplicate events, out-of-order arrivals, and evolving schemas. Batch systems often hid these issues through overnight compaction. The new platform is not necessarily wrong when it differs faster; it may simply be more honest.

4. Progressive delivery versus cutover simplicity

A big-bang migration is easy to describe and hard to survive. A progressive strangler migration is safer but creates a period of dual pipelines, duplicate cost, and operational complexity. Enterprises must be willing to pay that temporary tax.

5. Governance versus team throughput

The migration needs lineage, controls, certification, and audit evidence. It also needs rapid iteration because hidden rules will be discovered continuously. Too much centralized governance slows learning. Too little governance destroys trust. EA governance checklist

These forces are why simplistic migration templates age badly. The architecture must hold tension, not pretend it doesn’t exist.

Solution

The core solution is a dual-pipeline migration architecture with explicit reconciliation as a first-class capability, executed through a progressive strangler approach and organized around domain semantics.

In plain language:

keep the legacy platform running
build the new platform in parallel
route the same source data into both worlds, directly or via shared ingestion
compare outputs at multiple semantic levels
cut over domain by domain, use case by use case
preserve a rollback path until confidence is earned, not declared

This is not glamorous. It is disciplined.

The migration architecture should separate four concerns:

ingestion

Capture source data reliably from operational systems, logs, CDC streams, files, and service events.

domain transformation

Convert raw records into domain-aligned models with explicit business meaning.

reconciliation

Compare legacy and new outputs using domain-specific rules, tolerances, and exception workflows.

consumption and cutover

Shift reports, APIs, ML features, and downstream extracts progressively as each domain proves equivalence or justified difference.

The key word is justified. Reconciliation is not always about forcing equality. Sometimes the new platform is intentionally different because the old logic was wrong, inconsistent, or impossible to support in real time. In those cases, reconciliation must produce an accountable explanation: what changed, why it changed, who approved it, and which consumers are affected.

That turns migration from a technical rewrite into a governed business change.

Architecture

A pragmatic migration architecture usually looks like this:

The picture matters because it makes one truth unavoidable: for a while, you will pay for both paths.

That is not waste. It is insurance.

Ingestion layer

The ingestion layer should be designed to minimize divergence caused by the migration itself. Where possible, both legacy and new platforms should read from the same extracted change streams or source snapshots. If the old platform uses one extraction logic and the new platform uses another, teams often end up reconciling ingestion differences instead of business transformations.

CDC is especially valuable here for transactional systems. It gives a more faithful account of changes than daily full extracts, and it aligns well with Kafka-based distribution. But CDC brings its own failure modes: reordering, transaction boundary handling, schema drift, and deletes modeled inconsistently. None of these are reasons to avoid it. They are reasons to model it carefully.

Domain transformation layer

This is where domain-driven design earns its keep. Instead of rebuilding the entire warehouse as one giant transformation graph, organize the new platform around bounded contexts: Orders, Claims, Payments, Inventory, Customer Support, and so on. Each domain owns the transformation from technical source data into business-relevant data products.

That ownership matters because reconciliation can then happen against domain semantics rather than anonymous tables. The question becomes “Does recognized revenue in Finance match approved migration tolerances?” not “Do 14 columns in fact_sales look similar?”

A migration that lacks bounded contexts tends to centralize every semantic dispute into one overworked platform team. That team becomes a bottleneck and, worse, an accidental owner of business meaning. Architects should resist that pattern.

Reconciliation layer

The reconciliation capability is the heart of the architecture. It should compare data across several levels:

technical parity: counts, schema conformity, null rates, duplicates
entity parity: customer, order, invoice, claim, shipment identity and state
aggregate parity: daily revenue, inventory position, aged receivables, open tickets
semantic parity: domain rules such as “a claim is open if last approved status precedes settlement”
timing parity: expected lag windows, late-arriving tolerance, event-time versus processing-time differences

A proper reconciliation engine is not one SQL script. It is a managed capability with rule definitions, thresholds, lineage, evidence, and exception handling.

This is one of those enterprise capabilities that sounds bureaucratic until you need to explain to audit why month-end reports changed after migration. Then it becomes very interesting very quickly.

Consumption and cutover layer

Consumers should move progressively. Some can be switched early, especially exploratory analytics and internal data science workloads. Others, especially regulatory reporting and financial close processes, should move late and with evidence.

A useful tactic is to front some consumers with a compatibility contract: same business interface, switchable underlying source. This is the strangler pattern applied to data consumption. It reduces consumer rewrites and lets the migration team manage cutover centrally.

Migration Strategy

The migration strategy should be progressive, domain-led, and explicitly reversible.

A good sequence often looks like this:

Step 1: establish shared ingestion

Before rebuilding dozens of transformations, stabilize extraction and delivery. If Kafka is part of the target architecture, use it as a distribution backbone, not as a religion. Not every source needs event-native modeling immediately. Some domains are better served initially by CDC topics plus batch backfills.

The point is to make source movement observable and repeatable.

Step 2: choose a bounded context, not a platform slice

Do not migrate “all bronze tables” or “all ETL jobs from tool X.” Those are implementation-centered slices. Choose a domain with clear business sponsorship and manageable semantics, such as Customer Support interactions or Product Catalog. You want a domain where differences can be explained by people who understand the business.

Step 3: define reconciliation contracts up front

Teams often make the mistake of building the new pipeline first and discussing reconciliation later. That is backwards. The reconciliation contract should state:

which entities and aggregates matter
which tolerances are acceptable
what timing windows are expected
which differences are defects versus approved improvements
who signs off

This avoids endless debates after the first mismatches appear.

Step 4: run dual pipelines long enough to learn

Parallel run should not be ceremonial. It should be long enough to capture end-of-month, quarter-close, seasonal peaks, and known exception patterns. If your business has promotions, renewals, claim cycles, or inventory counts, your dual run must cross those boundaries.

Short parallel runs mostly prove that happy-path data exists.

Step 5: cut over by consumer risk

Low-risk consumers first. Executive scorecards and regulatory outputs last. There is no heroism in switching critical finance reports early just to satisfy a milestone chart.

Step 6: retire old logic only after semantic evidence

A legacy transformation should be retired when:

its downstream consumers have moved
reconciliation has passed over sufficient operating cycles
exceptions are understood
rollback cost is acceptable
lineage and controls exist in the new platform

Retirement should be a governance event, not just a deletion. ArchiMate for governance

Enterprise Example

Consider a global insurer migrating from an on-premises enterprise data warehouse to a cloud lakehouse with Kafka-based event ingestion and domain-oriented data products.

The legacy warehouse had grown over fifteen years. Claims, policies, billing, agent management, and finance all fed nightly ETL jobs. The warehouse produced regulatory reports, reserve analysis, fraud features, and executive dashboards. The modernization goal was to support near-real-time claims visibility and reduce dependence on brittle ETL tooling.

On paper, the answer looked obvious: stream operational events into Kafka, land raw data in object storage, transform with scalable compute, and retire the old warehouse. In the steering committee deck, it was five boxes and three arrows.

Reality was less polite.

The claims domain exposed the first semantic crack. In the policy administration system, claim status changes arrived as events. In the old warehouse, nightly ETL condensed these into a single “current status” based on a sequence of rules nobody had documented well. Some reversals were ignored. Some reopenings were folded into the original claim lifecycle. Some regional systems wrote timestamps in local time and others in UTC. The finance team had adapted to these quirks over a decade. They were not visible in source schemas, only in outcomes.

If the new streaming platform simply replayed events into a current-state table, open-claim counts differed from the warehouse. Not wildly. Just enough to trigger alarm.

A naive team would have declared the old logic broken and forced adoption of the new number. That would have been technically defensible and organizationally disastrous.

Instead, the insurer created domain reconciliation for Claims. Business stewards, actuaries, claims operations, and data engineers defined several comparison layers:

claim identity and survivorship
open/closed/reopened state semantics
reserve amount by event date and accounting date
regional timing tolerances
approved exceptions where the new model intentionally corrected old defects

For six months, both pipelines ran. Kafka carried claim events from source systems into the new platform. The legacy ETL continued nightly. Reconciliation reports showed where counts diverged, by region and product line. Many gaps were traced to time zone normalization, duplicate event suppression, and state transitions previously flattened by batch logic. Some differences were accepted as improvements. Others drove changes in the new transformations to preserve finance meaning.

The final migration did not cut over “the claims warehouse table.” It cut over specific consumers in sequence:

operational dashboards for claims managers
fraud feature pipelines
reserve analysis workbench
finance close reports
regulatory extracts

That order mattered. The architecture reduced enterprise risk by respecting domain semantics and consumer criticality.

This is what mature migration looks like. Less revolution. More controlled replacement of the ship while still at sea.

Operational Considerations

A dual-pipeline migration is operationally heavy. Anyone saying otherwise is either inexperienced or selling software.

Observability

You need more than pipeline success metrics. Watch:

source lag
CDC transaction gaps
Kafka consumer lag
schema evolution events
watermark progression
row and entity-level reconciliation rates
late data patterns
drift trends by domain and consumer
defect aging for unresolved mismatches

Make reconciliation visible. If divergence only appears in an analyst’s spreadsheet three weeks later, the architecture is already failing.

Metadata and lineage

Lineage is not decoration during migration. It is the explanation engine. When revenue differs between platforms, the first question is not “which number is right?” It is “which logic produced each number?” Without lineage, every difference becomes an argument from authority.

Exception workflow

Some reconciliation failures deserve engineering fixes. Others need business decisions. Treating all mismatches as technical incidents overwhelms the platform team and hides semantic disagreements. A proper operating model routes exceptions to domain stewards where needed.

Data contracts and schema change management

Kafka and microservices bring healthy decentralization, but schema changes can become migration landmines. Contracts should define compatibility expectations, deprecation windows, and ownership. A consumer-driven free-for-all is a bad match for high-stakes data migration.

Historical backfill

Backfill is where many projects lose months. Historical data often lacks the event fidelity of current streams. You may need separate logic for history and forward processing, with reconciliation proving the stitched result. Be careful: this can create one-off logic that survives forever if nobody plans its retirement.

Tradeoffs

The architecture is effective, but it is not free.

Cost

Running two pipelines, two storage models, and a reconciliation layer increases infrastructure and labor cost. This is the price of reducing business risk. If the enterprise cannot tolerate the temporary cost, it probably cannot tolerate the migration risk either.

Speed

Reconciliation slows cutover. That is its job. Teams wanting a rapid migration often frame it as bureaucracy. Usually what they really want is permission not to look too closely.

Complexity

A dual-run architecture is inherently more complex than a big-bang replacement. There are more moving parts, more dashboards, more rules, and more exceptions. Complexity is acceptable when it is temporary and purposeful. It becomes dangerous when organizations fail to retire old paths.

Autonomy versus alignment

Domain teams owning semantics improves clarity, but enterprise aggregates still need shared definitions where required. This balance is hard. Too much autonomy and the platform fragments. Too much central control and every domain waits for one canonical model that never quite arrives.

Failure Modes

The common failure modes are depressingly consistent.

Reconciliation reduced to row counts

This is the classic error. Counts match, migration declared successful, executive metrics later drift. Technical parity is necessary and insufficient.

No explicit semantic owner

When a mismatch appears, nobody can decide whether it is a defect or a valid change. The migration stalls in endless meetings. If a domain has no semantic owner, it is not ready to migrate.

Big-bang cutover after superficial dual run

A short side-by-side test rarely covers seasonal, financial, and exception-driven behavior. The first real reconciliation happens in production under pressure. That is not validation. That is gambling.

Event model assumed to equal business truth

Kafka gives you transport and ordering properties within limits. It does not magically define domain meaning. Teams often confuse “we have all the events” with “we have the right business state.”

Legacy quirks copied blindly

The opposite problem also appears. Teams preserve every historical oddity in the new platform because reconciliation says numbers must match. This creates modern infrastructure carrying ancient mistakes. Not every difference should be eliminated; some should be governed and embraced.

Never-ending dual pipelines

Temporary architectures have a terrible habit of becoming permanent. If retirement criteria are vague, the enterprise pays double forever. Every dual-run program needs explicit exit conditions.

When Not To Use

This approach is powerful, but it is not universal.

Do not use full-blown dual-pipeline reconciliation if:

the data domain is low criticality and easy to reconstruct
the existing platform has little business logic and mostly stores raw replicated data
the migration scope is a contained workload rather than an enterprise reporting foundation
the old system is already untrusted and there is no value in proving parity against it
timelines are so constrained that partial rebuild with acceptance of changed outputs is the actual business decision

In those cases, simpler migration tactics may be better: direct rebuild, consumer-by-consumer rewrite, or bounded historical reload without prolonged dual operation.

Similarly, do not force event streaming into domains that do not need it yet. If a batch-oriented finance process closes daily and source systems are not event-mature, a well-governed batch migration may be more sensible than performative streaming.

Architecture should fit the forces. Not the fashion.

Several patterns complement this approach.

Strangler Fig Pattern

The obvious companion. Replace the old platform incrementally, consumer by consumer and domain by domain, rather than by wholesale switch. For data, this means coexistence and managed redirection.

Anti-Corruption Layer

Useful when the legacy platform encodes concepts that should not leak into the new domain model. The anti-corruption layer translates old semantics without infecting the target architecture.

Data Mesh, carefully applied

Domain-owned data products align well with migration by bounded context. But during migration, federated ownership must still sit inside enterprise controls for reconciliation, lineage, and certification. Data mesh without migration governance becomes semantic drift at scale.

Event Sourcing, selectively

For some domains, especially where state transitions matter deeply, an event log can improve traceability and replay. But event sourcing is not a universal migration answer. Reconstructing historical truth from incomplete or inconsistent events can be harder than teams expect.

Change Data Capture

Often the most practical bridge from legacy operational systems into both old and new platforms. CDC supports progressive strangling well, provided ordering, deletes, and schema changes are handled rigorously.

Summary

Data platform migration is not a data movement exercise. It is a truth-preservation exercise conducted under changing machinery.

That is why reconciliation is indispensable. Not because enterprises enjoy extra controls, but because migration creates two simultaneous interpretations of the business, and somebody must decide where they differ, why they differ, and whether the new one is good enough to trust.

The architecture that works is usually unfashionable in the best possible way: dual pipelines, explicit reconciliation, domain ownership of semantics, progressive strangler cutover, and disciplined retirement of legacy logic. Kafka can help. Microservices can help. Cloud lakehouses can help. None of them remove the need to confront meaning.

The most successful migrations understand one simple thing: data platforms do not fail when bytes go missing. They fail when business meaning becomes negotiable.

So build the migration around that truth. Compare relentlessly. Cut over carefully. Let domains define semantics. Preserve rollback until evidence exists. And when the old and new platforms disagree, do not ask first which table is wrong.

Ask what the business meant all along.

Frequently Asked Questions

What is a data mesh?

A data mesh is a decentralised data architecture where domain teams own and serve their data as products. Instead of a central data team, each domain is responsible for data quality, contracts, and discoverability.

What is a data product in architecture terms?

A data product is a self-contained, discoverable, trustworthy dataset exposed by a domain team. It has defined ownership, SLAs, documentation, and versioning — treated like a software product rather than an ETL output.

How does data mesh relate to enterprise architecture?

Data mesh aligns data ownership with business domain boundaries — the same boundaries used in domain-driven design and ArchiMate capability maps. Enterprise architects play a key role in defining the federated governance model that prevents data mesh from becoming data chaos.