Data Platform Modernization Requires Reconciliation

⏱ 19 min read

Modernizing a data platform is usually sold as a technology refresh. New lakehouse. New streaming backbone. Better query engine. Cloud elasticity. AI readiness. The slides are always clean.

Reality is not.

A serious modernization is not a database swap. It is a confrontation with the organization’s own semantics. It is a forced audit of every quiet compromise encoded over ten years in ETL jobs, batch windows, spreadsheet fixes, lookup tables, late-night stored procedures, and “temporary” exception handling that became policy. The old platform may be slow, expensive, and unloved, but it knows the business in ways the architects often do not. That is the trap. People think they are replacing infrastructure. They are really rediscovering the enterprise.

And that is why data platform modernization requires reconciliation.

Not as a nice-to-have. Not as a reporting utility on the side. As a first-class architectural capability.

If you run the new platform beside the old one, and you should, then you need a disciplined way to compare outputs, explain differences, classify defects, and build trust. Without reconciliation, dual-run becomes theater. With it, dual-run becomes an engineering instrument. One tells you “we launched.” The other tells you whether you were right to.

This article argues for a reconciliation-centric modernization approach: progressive migration, domain-aware validation, and dual-run architecture where the old and new worlds coexist long enough for the business to believe the answer. I will focus on enterprise settings where Kafka, microservices, event streams, and mixed batch/stream processing matter. I will also be blunt about tradeoffs, failure modes, and when this pattern is simply too expensive or too complicated to justify. event-driven architecture patterns

Context

Large enterprises rarely have a single “data platform.” They have layers of historical intent.

There is usually a core operational estate: ERP, CRM, policy systems, order management, finance ledgers, warehouse systems, digital channels, and a menagerie of acquired applications. Around that sits an integration estate: MQ, ESB, FTP, change data capture, APIs, hand-built connectors, perhaps Kafka if the organization got serious in the last decade. Then the analytical estate: enterprise warehouse, reporting marts, notebooks, data science stores, semantic layers, dashboard tooling, regulatory extracts. None of this was designed in one sitting. It accreted.

Modernization arrives because the accretion starts to hurt. Costs rise. Release cycles slow. Data latency becomes unacceptable. Schema changes are frightening. Analysts distrust the numbers. Business teams create shadow pipelines. Regulators ask harder questions. The current platform becomes both critical and brittle, which is the worst possible combination.

At that point, leaders often ask for a target architecture. That is reasonable, but insufficient. A target architecture without a migration architecture is just a fantasy with boxes.

The central challenge is this: the current platform encodes business meaning, but imperfectly and opaquely. The new platform aims to encode that meaning more cleanly, with clearer ownership, better lineage, and modern operating characteristics. Yet the moment you change representations, timings, grain, deduplication rules, event models, keys, or adjustment logic, numbers move. Some movement is a bug. Some is a correction. Some is the exposure of ambiguity that was always there.

You cannot manage that by intuition. You need reconciliation.

Problem

The hard part of modernization is not moving bytes. It is preserving, clarifying, and sometimes intentionally changing business semantics without losing organizational confidence.

Consider a simple metric: “active customer.” On one platform it may mean at least one billable transaction in the last 90 days, excluding internal entities and dormant household records, computed nightly from a curated warehouse table. On a new event-driven platform, a product team may define it as a customer with a recent interaction event and a non-closed account state, materialized near real-time from Kafka streams. Both sound plausible. Both can be defended in meetings. Both can be disastrously wrong if used interchangeably.

This is the modernizer’s burden: legacy systems often hide semantics inside code paths no one owns explicitly. New platforms surface the chance to model the domain properly, but they also surface disagreement.

Dual-run is the sensible response. Keep old and new platforms running in parallel long enough to compare outputs and reduce risk. But parallel operation by itself creates another problem: two versions of the truth. If you cannot systematically reconcile them, stakeholders lose patience. Finance says one thing, operations says another, product says “the new one looks better,” and executives stop listening.

A dual-run without reconciliation tends to fail in one of three ways:

The comparison is too shallow

Teams compare a handful of top-line aggregates and declare victory, while entity-level mismatches pile up underneath.

The comparison is too technical

Architects validate row counts, latency, and schema parity, while the business asks why refunds shifted by 2.3% in a regulatory report.

The comparison is too late

Reconciliation is treated as final acceptance rather than a continuous migration discipline, so defects emerge after cutover pressure peaks.

The deeper issue is that data modernization crosses bounded contexts. Sales, billing, fulfillment, claims, finance, risk, customer service, and digital channels all use the same nouns differently. A customer is not the same thing everywhere. Neither is an order, shipment, claim, contract, exposure, or account. Domain-driven design matters here, not as ceremony, but as survival.

Forces

Several forces pull against each other in a modernization program.

1. Continuity versus correction

The business wants continuity. “Make the new platform match the old one.” But modernization is often motivated by defects in the old one. You may discover duplicate joins, broken slowly changing dimensions, timing distortions, or hidden manual adjustments. If you match the old platform perfectly, you preserve its mistakes. If you correct too much too early, the migration looks like failure.

This is where reconciliation must distinguish equivalence from explainable variance. They are not the same.

2. Domain ownership versus central platform control

A central data team can build shared ingestion, storage, catalog, governance, and observability. It cannot safely invent the meaning of policy lapse, net revenue, stock on hand, or patient episode. Those belong in domain contexts. EA governance checklist

Yet if every domain team defines everything independently, the enterprise loses coherence. The answer is not centralization or total federation. It is explicit contracts around data products and semantics, with reconciliation acting as the referee during migration.

3. Batch heritage versus event-driven ambition

Most legacy platforms are batch-first. Many modern targets are event-first, often built on Kafka, CDC, stream processing, and microservices publishing domain events. This introduces timing differences, out-of-order events, replay behavior, late-arriving facts, and eventual consistency. The old nightly report may “agree” with itself because all timing has been flattened. The new platform may be more correct and less synchronized. microservices architecture diagrams

That does not make migration easier. It makes reconciliation more subtle.

4. Executive urgency versus semantic patience

Leadership wants a cutover date. Semantics do not care about deadlines.

The organization may tolerate a quarter of system instability. It will not tolerate a quarter of unexplained finance discrepancies. Trust evaporates faster than platform debt accumulates.

5. Cost versus confidence

Dual-run is expensive. Reconciliation is more expensive. You process data twice, store it twice, observe it twice, and argue about it three times. But confidence has a price too. Enterprises that skip reconciliation usually pay through audit findings, broken incentives, lost adoption, and emergency rollback plans.

Solution

The core pattern is straightforward:

Build the new platform alongside the old one. Feed both from shared or equivalent sources. Reconcile outputs continuously at multiple semantic levels. Migrate domain by domain using a progressive strangler approach. Cut over only when discrepancies are understood, bounded, and accepted.

This sounds obvious. It is not common enough.

A good reconciliation-centric modernization has five properties.

1. It is domain-aware

Reconciliation is not just technical comparison. It aligns domain concepts across bounded contexts. For each important business object and metric, define:

canonical business meaning
source of authority
transformation rules
timing rules
tolerance thresholds
known legacy defects
acceptable differences during migration

That is DDD applied to data migration. Ubiquitous language becomes executable architecture.

2. It compares at multiple levels

You need at least three layers of validation:

Pipeline integrity: ingestion success, completeness, schema drift, freshness, event lag
Entity reconciliation: record-by-record or key-by-key comparisons for orders, claims, accounts, shipments, invoices
Business outcome reconciliation: aggregates, KPIs, financial balances, operational counts, regulatory reports

If you only do the first, you prove the pipes work. If you only do the last, you miss where they broke.

3. It treats differences as structured work

Every discrepancy needs classification:

expected due to design change
expected due to timing
legacy defect exposed
new platform defect
source data issue
unresolved semantic mismatch

This sounds bureaucratic until the third steering committee where everyone asks why “the variance moved again.”

4. It supports progressive strangler migration

Do not replace the whole estate in one cut. Strangle by domain, capability, or consumption path. Start with low-regret use cases. Keep producing comparable outputs. Move readers before writers where possible. Retire legacy components only when their semantic obligations are discharged.

5. It is observable

A modernization without observability is just hope with cloud spend.

You need lineage, data quality signals, schema versioning, replay capability, auditability, and dashboards showing reconciliation status by domain. The program should know, every day, what agrees, what does not, and whether the variance is shrinking.

Architecture

A typical reconciliation-centric target architecture has four planes: source capture, processing, reconciliation, and consumption.

The architecture matters less than the responsibilities.

Source capture

Use CDC, event publishing, APIs, and batch extraction pragmatically. Enterprises rarely have the luxury of pure event sourcing. If core systems can publish reliable domain events to Kafka, excellent. If not, use CDC from transactional databases and treat it honestly: it is a change stream, not always a business event stream.

This distinction matters. A row update saying status = SHIPPED is not the same as a domain event that says “shipment confirmed” with a business timestamp and causal context. Reconciliation often fails because architects blur those two.

Processing and domain data products

The modern platform should organize around domains, not just technical zones. Raw, refined, and curated layers are fine, but they are not a business architecture. For each domain, produce data products with explicit contracts: keys, grain, history policy, quality expectations, SLA/SLO, ownership, and semantic definitions.

In a Kafka-heavy environment, stream processing can materialize operational views and near-real-time aggregates. Batch still has a place for heavy historical transformations, restatements, and large-scale reconciliation windows. Mature enterprises end up with both.

Reconciliation engine

This is the star of the show. It should not be an ad hoc SQL folder.

A proper reconciliation capability usually includes:

comparison rules by dataset and metric
key mapping and survivorship logic
windowing rules for late or out-of-order data
tolerance handling
variance classification
drill-down from aggregate to entity
immutable audit trail
workflow integration for triage and approval

Think of it as a test harness for business truth.

Consumption

For a period, both legacy and modern consumers coexist. Some reports continue on the old platform while data science, operational APIs, or selected BI use the new one. This is healthy. It reduces blast radius. But only if there is transparency over which outputs are authoritative at each stage.

That authorization is part governance, part architecture, part political courage. ArchiMate for governance

Dual-run architecture diagram

A useful way to think about dual-run is that the old and new platforms are less like primary and backup, and more like two clocks in a train station. The question is not whether they both tick. The question is whether they tell the same time, and if not, who can explain the drift.

Notice what is being reconciled: not just tables, but outputs. The enterprise does not consume parquet files. It consumes decisions, dashboards, invoices, reserve calculations, and customer experiences.

Migration Strategy

This kind of modernization should follow a progressive strangler pattern.

Start from the edge. Migrate capabilities incrementally. Build confidence through repeated reconciliation cycles. Retire legacy elements only after they are behaviorally obsolete, not merely technologically outdated.

Step 1: Establish semantic inventory

Before building too much, identify critical business objects, metrics, and reports. Map where their semantics live today. You will find logic in ETL, views, report calculations, application code, and sometimes user procedures. This is painful, but there is no shortcut.

Create bounded-context maps. A policy in underwriting, billing, and claims may share an identifier but not the same lifecycle meaning. An order in sales and fulfillment may diverge after allocation. Reconciliation requires knowing these seams.

Step 2: Build common ingestion and observability

Create the capture layer and baseline observability first. If Kafka is in scope, define event contracts, schema evolution rules, replay strategy, and retention policies. If CDC is the main source, define table-level semantics and synthetic event derivation carefully.

Do not wait for “full target architecture” completion. You need the new pipeline to begin receiving production-shaped data early.

Step 3: Select migration slices

Choose slices that are meaningful but bounded. Good slices tend to have:

clear source authority
manageable downstream dependencies
measurable outcomes
domain ownership available
low regulatory exposure for first attempts

For example, migrate customer interaction analytics before statutory finance, or shipment visibility before general ledger reconciliation.

Step 4: Run dual outputs

For each slice, produce the same or equivalent outputs in old and new platforms. Run them in parallel long enough to see normal variation: period-end, weekend behavior, retry storms, delayed feeds, corrections, and seasonal spikes.

A single “green week” proves almost nothing.

Step 5: Reconcile continuously

Automate comparison. Triage variances. Feed defects back into transformation logic, source mappings, or semantic definitions. Maintain an explicit ledger of accepted differences. This ledger becomes invaluable during audit and executive review.

Step 6: Migrate consumers gradually

Move consumers in an order that reduces operational risk:

internal analysts
selected dashboards
downstream APIs
operational decisions
regulated and financial outputs

Writers and transactional systems usually move later, if at all. Many modernizations leave operational systems in place while replacing the analytical and integration substrate around them.

Step 7: Decommission deliberately

Legacy decommissioning is where many enterprises become careless. Do not turn off the old platform just because the new one has run for 90 days. Turn it off when:

semantic parity or accepted variance is documented
downstream dependencies are known and migrated
replay and audit capabilities exist on the new side
support teams are trained
rollback options are realistic, not fictional

Enterprise Example

Consider a global retailer modernizing from a legacy enterprise data warehouse into a cloud lakehouse with Kafka-based event ingestion.

The old world had nightly batch feeds from POS, e-commerce, inventory, pricing, and ERP. Revenue reporting was “stable” in the way old industrial machinery is stable: noisy, expensive, and dangerous to touch. Shrinkage calculations, return handling, and promotion attribution were spread across ETL layers built over twelve years. Inventory accuracy was poor, but everyone had learned which reports not to trust on Mondays.

The new target introduced domain-oriented data products for sales, inventory, fulfillment, and customer. Store and digital events were published into Kafka where possible; older systems were captured through CDC and file drops. Stream processing created near-real-time stock and order views, while batch jobs handled daily financial close and historical restatements.

The team made one smart decision early: they treated reconciliation as a product.

They built a reconciliation service that compared:

sales transactions by channel and store
net revenue by promotion type
inventory position by SKU-location-day
return rates by reason code
financial postings against ERP close numbers

At first, variances were ugly. Not because the new platform was broken, though parts were. The deeper issue was semantic drift.

A return processed on the same day as sale was netted in the warehouse but represented as separate events in Kafka streams. Promotional bundles were allocated differently between product and finance teams. Inventory transfers were timestamped at dispatch in one system and receipt in another. The old warehouse silently dropped some duplicate barcode scans; the new platform preserved them until deduplication rules were made explicit.

Without reconciliation, the program would have descended into tribal warfare. With it, they could say: this 1.8% variance in net sales is composed of 0.7% timing difference, 0.5% legacy defect, 0.4% new transformation bug, and 0.2% unresolved policy on bundle allocation. That is an adult conversation.

They migrated store operations dashboards first, then merchandising analytics, then replenishment optimization feeds, and only later board-level financial reporting. Cutover happened by domain and use case, not by platform slogan. The old warehouse was decommissioned over eighteen months. Not fast. But the business believed the new numbers.

That belief was the real deliverable.

Operational Considerations

A reconciliation-centric architecture creates new operational burdens. Ignore them and the design collapses under its own virtue.

Data contracts and schema evolution

Kafka and microservices make data movement easier and semantic drift faster. Event schemas evolve. Optional fields appear. Meanings change while field names remain. You need schema registries, compatibility policies, and version-aware consumers. More importantly, you need business review for semantic changes, not just technical review.

Late and out-of-order data

Streaming systems lie to anyone who expects tidy chronology. Reconciliation must account for late arrivals, retries, duplicates, and replay. Windowing rules matter. Cutoff policies matter. Business users need to know whether a number is preliminary, settled, or restatable.

Identity resolution

Key mismatches are a classic source of false variance. MDM, survivorship rules, crosswalks, and identifier lifecycle management must be treated as architecture, not cleanup. If customer, policy, product, or location identities drift, reconciliation devolves into noise.

Audit and traceability

Especially in finance, healthcare, insurance, and regulated industries, reconciliation results are part of evidence. You need traceability from source event to transformed record to report line. “The pipeline said green” is not audit evidence.

Cost control

Dual-run doubles more than compute. It doubles storage, observability, support burden, and organizational attention. Set time-boxed goals for each migration slice. If a domain remains in indefinite dual-run, the architecture has failed to force a decision.

Tradeoffs

This pattern is powerful, but let us not pretend it is free.

Pros

reduces cutover risk
surfaces hidden semantics
builds stakeholder trust
supports incremental migration
provides auditability and evidence
helps distinguish defects from intentional changes

Cons

high operational and engineering cost
slower apparent delivery
potential for prolonged ambiguity
heavy demand on domain experts
increased platform complexity during transition

The biggest tradeoff is simple: reconciliation buys confidence by spending time.

In some enterprises, that is wise. In others, especially where the old platform is lightly used or the business semantics are genuinely simple, it may be overkill.

Failure Modes

The pattern itself can fail. Common failure modes include:

1. Technical reconciliation with no domain context

Teams compare schemas, row counts, and checksums while business metrics quietly diverge. This creates false confidence.

2. Infinite dual-run

No one defines acceptance criteria, so both platforms run forever. Costs mount. Teams support two estates. Trust declines rather than rises.

3. Over-centralized semantics

A central architecture team defines “canonical” models detached from real domain language. The platform becomes elegant and wrong.

4. Under-governed event streams

Kafka topics proliferate without clear ownership or contract discipline. Reconciliation chases moving targets.

5. Ignoring timing semantics

A nightly warehouse and a real-time platform will differ during the day. If you do not define comparison windows and settlement rules, every variance looks like a defect.

6. No explicit accepted-difference ledger

Some differences are intentional. If they are not documented and approved, they keep resurfacing as “new” issues.

When Not To Use

Do not use this pattern blindly.

You may not need full reconciliation-centric dual-run if:

the platform is small and low criticality
there are few downstream consumers
the domain semantics are simple and well documented
a one-time migration with short validation is enough
the old platform is so broken that matching it has little value
the business can tolerate temporary reporting discontinuity

It is also a poor fit where source systems are wildly unstable and no baseline truth exists. Reconciliation assumes there is enough consistency to compare. If the enterprise is still changing core business rules weekly, first stabilize the domain contracts. Otherwise you are measuring a moving coastline with a ruler.

Several adjacent patterns are relevant.

Strangler Fig Migration

Incrementally replace legacy capabilities rather than big-bang rewrites. Essential for reducing risk and proving value slice by slice.

Change Data Capture

Useful where source applications cannot emit rich domain events. But CDC should be handled with semantic humility.

Event-Driven Architecture

Helpful for timely, decoupled propagation of business changes. Introduces replay, ordering, and consistency concerns that reconciliation must address.

Data Products and Data Mesh

Useful when domain ownership is real. Dangerous when used as branding for unmanaged fragmentation. Reconciliation provides a cross-domain confidence mechanism.

Data Contracts

Critical for stable evolution of schemas and meanings across producers and consumers.

Observability and Lineage

Necessary to explain discrepancies and support auditability.

Summary

A data platform modernization is not won by standing up a new stack. It is won when the enterprise trusts the new answers.

That trust does not come from architecture diagrams alone, though diagrams help. It comes from the disciplined ability to run old and new side by side, compare them honestly, explain the differences, and migrate in a way that respects domain semantics instead of bulldozing them.

Reconciliation is the mechanism that turns dual-run from an expensive comfort blanket into a migration strategy.

Use domain-driven thinking to define what the business objects and metrics actually mean. Use progressive strangler migration to move capability by capability, not in one heroic leap. Use Kafka, CDC, microservices, and modern data tooling where they fit, but do not confuse transport with truth. Build reconciliation at the levels that matter: pipeline, entity, and business outcome. Make discrepancies visible. Classify them. Learn from them. Force decisions.

Because in the end, the old platform is not merely a system to retire. It is a memory palace of business decisions, mistakes, patches, and semantics. The new platform must earn the right to replace it.

And that earning happens in reconciliation.

Frequently Asked Questions

What is a data mesh?

A data mesh is a decentralised data architecture where domain teams own and serve their data as products. Instead of a central data team, each domain is responsible for data quality, contracts, and discoverability.

What is a data product in architecture terms?

A data product is a self-contained, discoverable, trustworthy dataset exposed by a domain team. It has defined ownership, SLAs, documentation, and versioning — treated like a software product rather than an ETL output.

How does data mesh relate to enterprise architecture?

Data mesh aligns data ownership with business domain boundaries — the same boundaries used in domain-driven design and ArchiMate capability maps. Enterprise architects play a key role in defining the federated governance model that prevents data mesh from becoming data chaos.