Batch vs Stream Reconciliation in Data Architecture

⏱ 19 min read

There are few things more dangerous in enterprise architecture than a number everyone believes and nobody can explain.

A balance is off by 0.7%. A customer record exists in three systems with three different “truths.” Yesterday’s settlement file says one thing, the event log says another, and the dashboard—always eager, never humble—claims everything is green. This is the quiet mess behind modern data estates. Not the glamorous side of architecture. Not the conference-slide side. The side where finance waits, operations escalates, and architects discover that “near real time” and “correct” are very different promises.

That is where reconciliation lives.

And once you start talking about reconciliation, you immediately hit a hard architectural question: do you reconcile in a batch lane, where records are collected, compared, corrected, and published on a schedule? Or do you reconcile in a streaming lane, where discrepancies are detected and handled continuously as events move through the estate?

This is not a technology choice disguised as architecture. It is a domain choice. It is about how the business experiences truth, lateness, finality, and correction. It is about whether a transaction is considered “done” when it is emitted, when it settles, when a downstream system accepts it, or when the monthly close passes without argument.

Good architecture starts there. Not with Kafka. Not with Spark. Not with whether your cloud vendor has a shiny managed service this quarter. Start with the semantics of the business, because reconciliation is really the discipline of making domain promises explicit. event-driven architecture patterns

Context

Most enterprises do not have one data platform. They have layers of history pretending to be a platform.

There is the operational core: order management, payments, policy administration, claims, billing, ERP, CRM. There are microservices around the edges, each with its own database and a healthy sense of autonomy. There are SaaS platforms with export APIs. There are warehouses and lakehouses collecting facts after the fact. And somewhere in this landscape, business users assume there is a coherent answer to simple questions: microservices architecture diagrams

How many orders shipped yesterday?
Which payments are authorized but not settled?
Which customer addresses are canonical?
Which invoices are missing from the ledger?
What is the exposure at this moment, not last night?

Reconciliation exists because distributed systems create multiple valid but incomplete perspectives. The order service may say “completed,” the payment service may say “captured,” the ledger may not yet reflect the posting, and the fulfillment platform may have retried the same event twice. Each system is locally sensible. The enterprise view is not.

Historically, batch reconciliation was the answer. End-of-day files. Nightly ETL. Control totals. Compare counts and sums. Produce exception reports. This approach survives because it matches how many enterprises actually work: periodic processes, governed checkpoints, and human review.

Streaming changed the ambition. Instead of waiting until the end of the day to discover mismatches, teams now want continuous visibility. Events flow through Kafka, services react in real time, stateful processors detect missing joins or amount mismatches, and reconciliation becomes an always-on capability rather than a nightly ritual.

But ambition is cheap. Operations are expensive. A streaming reconciliation architecture can reduce detection latency dramatically, but it also introduces new questions around ordering, event completeness, late arrivals, duplicate handling, replay, semantic versioning, and the boundary between “provisional truth” and “final truth.”

That is why the comparison between batch lane and streaming lane matters. They are not merely different pipelines. They are different operating models for truth.

Problem

The core problem is deceptively simple: how do we ensure that business facts represented across multiple systems remain consistent enough for the enterprise to operate safely?

In practice, that breaks into several harder questions:

How do we know two records represent the same business event?
When is it valid to compare them?
What counts as a mismatch: missing record, value difference, timing difference, or semantic disagreement?
Can mismatches be corrected automatically, or do they require investigation?
How do we preserve an audit trail of what was known, when, and why it changed?

This is where many implementations go wrong. They treat reconciliation as a low-level data comparison exercise—row counts, checksums, and field-by-field equality. Useful, yes. Sufficient, no.

Reconciliation is about domain semantics, not just record mechanics.

A payment authorization and a settlement are not expected to match at the same moment. An order total may differ from the invoice total for valid reasons such as tax recalculation or partial fulfillment. A customer record may diverge temporarily across systems because one domain owns identity, another owns communication preferences, and a third owns credit risk. If you compare without understanding ownership and lifecycle, you produce noise. Enterprises drown in noise.

The true architecture problem is to build a reconciliation capability that can distinguish:

Expected divergence from actual defects
Temporal inconsistency from permanent mismatch
Domain correction from technical failure
Authoritative source disagreement from data transport issues

That is why reconciliation belongs in enterprise architecture and domain-driven design, not just in the plumbing team.

Forces

Several forces shape the choice between batch and stream reconciliation.

1. Business tolerance for latency

If the business can accept discrepancies being discovered tomorrow morning, batch remains a strong option. If fraud detection, inventory exposure, trading risk, or customer entitlements require immediate confidence, streaming becomes attractive.

Latency is not free, but neither is immediacy.

2. Finality versus provisional state

Some domains are naturally eventful but not final in real time. Payments, logistics, healthcare claims, and insurance are full of provisional statuses. A stream may tell you what is happening now, but the batch may still be the mechanism by which the organization asserts what is final.

This distinction matters. Many teams build streaming reconciliation for domains whose truth is only settled in periodic cycles. They end up building a very expensive early-warning system and still relying on batch for financial or regulatory closure.

3. Volume and cardinality

Batch handles massive historical comparison well, especially when full recomputation is acceptable. Streaming shines when continuously matching records with manageable state windows or keyed joins.

If you need to reconcile billions of records with long-tail arrival patterns over 90 days, a naive streaming design becomes a state management nightmare.

4. Data quality and source behavior

Streams assume relatively disciplined event production: identifiers, ordering strategy, schemas, idempotency, and ownership. Enterprises often have the opposite. Mainframes emit files, SaaS platforms emit snapshots, and legacy systems mutate records without meaningful change history.

When sources are poor event citizens, batch may be the honest answer.

5. Operating model maturity

Batch can be run by a smaller, more traditional data operations model. Streaming requires stronger SRE practices, schema governance, replay discipline, event contract management, and robust observability. EA governance checklist

Streaming is not just a pipeline style. It is a commitment.

6. Audit and explainability

Reconciliation is often tied to controls, compliance, and external audit. Batch offers natural checkpoints and reproducible runs. Streaming can also be auditable, especially with immutable event logs, but only if the enterprise is disciplined about retention, replayability, and deterministic processing.

Solution

The best solution in most enterprises is not batch or stream. It is a dual-lane reconciliation architecture with explicit semantics for each lane.

Think of it as two roads serving different purposes:

The streaming lane is the fast lane. It detects likely discrepancies early, raises operational alerts, enriches downstream actions, and supports near-real-time decision-making.
The batch lane is the settlement lane. It performs complete, authoritative, replayable reconciliation over defined business periods and produces controlled exceptions, corrections, and audit artifacts.

This is the architecture adults build.

The mistake is to force one lane to do the other’s job. Streaming should not be burdened with proving final financial truth if the domain settles overnight. Batch should not be expected to support customer-facing decisions that depend on immediate discrepancy detection.

A sound design makes their roles explicit:

Streaming lane responsibilities

Ingest domain events from Kafka or equivalent event backbone
Correlate events across services using business keys
Detect missing expected events within time windows
Flag suspicious amount, status, or sequence mismatches
Publish discrepancy events to operational workflows
Provide provisional reconciled views

Batch lane responsibilities

Ingest complete extracts, snapshots, or durable event history
Reconcile across full business periods
Recompute balances, counts, and control totals
Validate financial and regulatory assertions
Produce authoritative exceptions and case queues
Maintain audit evidence and sign-off records

The key phrase is provisional versus authoritative. If you do not name that distinction in your architecture, the business will discover it the hard way.

Architecture

A practical architecture usually combines event-driven processing, domain-aligned reconciliation rules, and a batch backstop.

At the heart of the architecture sits a reconciliation domain, not just a set of jobs. This domain should have its own ubiquitous language:

reconciliation case
expected event
tolerance window
provisional match
authoritative match
exception reason
business key
correction action
control total
settlement period

This is classic domain-driven design thinking. Reconciliation is often treated as an afterthought spread across ETL scripts, Kafka consumers, and BI SQL. That produces fragmented logic and contradictory results. Instead, model reconciliation as its own bounded context, with clear integration points to source domains like Orders, Payments, Ledger, and Customer.

Domain semantics first

Suppose an e-commerce company wants to reconcile orders and payments. A simplistic architecture compares order amount with payment amount. A better domain model asks:

Is reconciliation at order level, payment instruction level, capture level, or settlement level?
Are split tenders allowed?
Are partial shipments valid before final invoice?
Does “paid” mean authorized, captured, or settled?
Are chargebacks part of the same lifecycle or a separate reconciliation cycle?

Without these semantics, the technical design is theater.

Streaming lane design

In the streaming lane, Kafka is often the right backbone because it preserves ordered partitions, supports replay, and allows multiple consumers. But Kafka does not solve semantics. It merely gives you a durable log.

A streaming reconciliation service typically:

consumes related domain topics
normalizes events into reconciliation facts
joins or correlates by business key
tracks expected versus observed lifecycle milestones
emits discrepancy events or provisional reconciled records

For example:

OrderPlaced
PaymentAuthorized
PaymentCaptured
OrderShipped
InvoiceIssued
LedgerPosted

The service maintains state keyed by orderId or paymentId, with configurable windows and tolerances. If OrderPlaced occurs but no PaymentAuthorized appears within 5 minutes, raise an operational exception. If PaymentCaptured exceeds InvoiceIssued by more than tolerated tax adjustment, raise a mismatch.

This is not “stream processing” in the abstract. It is lifecycle verification.

Diagram 2 — Batch vs Stream Reconciliation in Data Architecture

Batch lane design

The batch lane works differently. Here the goal is completeness, repeatability, and control.

A batch reconciliation engine typically:

loads snapshots, files, or event-history slices for a period
aligns records to canonical business keys
applies deterministic reconciliation rules
calculates control totals and variance reports
persists exception records with lineage and evidence
supports reruns for the same accounting or business period

This lane often lands in the warehouse, lakehouse, or a dedicated reconciliation platform. It may consume Kafka-compacted topics or event archives as inputs, but the processing posture is different: stop, compare, explain, certify.

Canonical model or not?

Architects often reach for a canonical enterprise model here. Use restraint.

A thin canonical reconciliation model is useful: business key, event type, amount, effective time, source system, status, correlation identifiers. But a giant enterprise-wide canonical schema usually becomes a bureaucratic museum piece. Prefer semantic normalization at the boundary, while preserving source-specific detail for investigation.

Migration Strategy

No serious enterprise moves from batch-only reconciliation to streaming reconciliation in one jump. That path ends in a war room.

The right move is a progressive strangler migration.

Start by acknowledging what the batch system already does well: it provides trusted controls, finance confidence, and operational muscle memory. Do not rip that out because streaming feels modern. Add a streaming lane alongside it, initially as an observational capability.

A sensible migration path looks like this:

Phase 1: Instrument the domain

Identify the highest-value reconciliation journeys—payments, orders, ledger postings, inventory movements. Capture events from source systems through native publishing, CDC, or integration wrappers. At this stage, do not overpromise real-time correction. Focus on event quality and identifiers.

Phase 2: Build shadow streaming reconciliation

Run stream processors in parallel with the existing batch controls. Produce discrepancy signals, but do not let them drive formal business decisions yet. Measure:

false positives
false negatives
late-arrival behavior
duplicate event rates
match-rate confidence by source

This phase is humbling. Good. Architecture should be humbled before production is.

Phase 3: Move operational use cases first

Use streaming reconciliation for early-warning and operational triage:

missing payment confirmations
delayed shipment events
duplicate invoicing signals
customer entitlement mismatches

These are domains where early detection matters more than legal finality.

Phase 4: Shift narrow authoritative checks

Only after confidence is high should selected authoritative controls move from batch to streaming or micro-batch, and even then, only where the domain supports real-time finality.

Phase 5: Retain batch where it belongs

Do not force full retirement. In many enterprises, batch remains the right architecture for monthly close, regulated reporting, and historical recomputation. Mature architecture is selective, not ideological.

Enterprise Example

Consider a global retail bank modernizing card payment reconciliation.

The bank had:

a core card authorization platform
a settlement platform
a ledger posting system
fraud services
customer channels
a nightly reconciliation engine fed by flat files

The nightly process compared authorization counts, settlement totals, chargebacks, and ledger postings. It was trusted, but painfully slow. Fraud teams could not act quickly enough on certain anomalies. Operations discovered missing postings the next day. Customers saw pending transactions that did not align with account views.

A fashionable answer would have been “put everything on Kafka and reconcile in real time.” That would have been reckless.

The bank instead modeled three distinct reconciliation domains:

Operational payment lifecycle reconciliation

Detect missing or duplicate authorization, capture, and reversal events within minutes.

Customer-view consistency reconciliation

Ensure digital channel balances and transaction views reflect the latest valid status, even if settlement is not complete.

Financial settlement reconciliation

Reconcile settlement files, fee calculations, chargebacks, and ledger entries for accounting finality.

The first two moved to a streaming lane. Events from card services, fraud microservices, and channel APIs flowed into Kafka. Stream processors correlated by payment reference, card token, and merchant sequence numbers. Cases were raised within minutes for duplicate captures, missing reversals, and stale customer-channel states.

The third remained batch-oriented. Settlement files arrived from networks on periodic cycles. The ledger was reconciled by business date, not event time. Batch jobs computed authoritative balances and generated audit evidence for finance.

This hybrid design produced real value:

operational discrepancy detection dropped from hours to minutes
customer-facing transaction accuracy improved substantially
finance retained trusted settlement controls
the architecture team avoided trying to pretend settlement finality existed before settlement actually occurred

That is the point. The architecture matched the domain.

Operational Considerations

This is where good designs often die.

State management

Streaming reconciliation maintains state, and state is where optimism goes to suffer. Long windows increase completeness but consume more resources and complicate recovery. Short windows reduce cost but increase false discrepancies due to late arrivals.

Choose windows based on actual domain lateness, not guesswork.

Idempotency and duplicates

Many reconciliation defects are self-inflicted by duplicate consumption, replay side effects, or producer retries. Every reconciliation service should be idempotent. Every event should carry stable business identifiers and source event identifiers. If your ecosystem cannot provide that, expect pain.

Ordering

Kafka gives order within a partition, not across the enterprise. If events that must be compared are partitioned inconsistently, your stream logic will become awkward or wrong. Partitioning strategy is architectural, not incidental.

Schema evolution

Reconciliation logic is fragile when event contracts drift. Use schema registries, compatibility policies, and explicit version handling. A renamed status code can quietly invalidate business rules and flood case queues.

Observability

You need more than CPU and lag dashboards. Reconciliation observability should include:

match rates by domain and source
discrepancy rates by reason code
late-arrival distributions
duplicate rates
replay counts
batch-versus-stream variance

If you cannot explain why discrepancy volume changed this week, you are not operating the system; you are merely hosting it.

Human workflow

Some discrepancies can be auto-corrected. Many should not be. Build case management, assignment rules, evidence capture, and feedback loops into the architecture. Reconciliation is a socio-technical system. The exception queue is part of the design.

Tradeoffs

Batch lane and streaming lane are both compromises. The question is which compromise fits the domain.

Batch advantages

simpler operational model
natural fit for periodic business controls
easier reproducibility and audit checkpoints
good for large historical recomputation
tolerant of weaker source event quality

Batch disadvantages

slow discrepancy detection
poor support for customer-facing immediacy
delayed remediation
often dependent on brittle file schedules
tends to hide temporal dynamics until too late

Streaming advantages

fast detection and response
better support for event-driven microservices
aligns with operational workflows
can provide continuous visibility
supports reactive automation

Streaming disadvantages

higher complexity in state, replay, and correctness
more sensitive to event quality and identifiers
difficult with long settlement cycles
can create false confidence in provisional truth
operationally demanding

The sharpest tradeoff is this: streaming improves timeliness; batch improves completeness. You can mitigate that tension, but you do not erase it.

Failure Modes

Architects should talk more about failure modes. Systems fail in characteristic ways, and reconciliation systems fail in particularly embarrassing ones.

1. Semantic mismatch dressed as technical defect

Teams compare statuses or amounts without understanding lifecycle meaning. The system raises thousands of “errors” that are actually valid temporal states.

2. Missing business keys

No stable correlation identifier exists across systems. Reconciliation devolves into fuzzy matching, heuristics, and arguments. This is often a domain modeling failure upstream.

3. Late data floods exception queues

A stream processor with aggressive time windows marks events as missing, only for them to arrive later. Operations stop trusting the alerts.

4. Replay creates phantom discrepancies

Historical event replay is treated as fresh business activity because consumers are not replay-aware or idempotent. Suddenly the enterprise appears to have doubled its exceptions.

5. Batch and stream disagree with no arbitration model

Both lanes exist, but there is no rule for which result is authoritative in which context. The architecture produces two truths and calls it resilience.

6. Reconciliation logic is scattered

Some rules live in SQL jobs, others in Kafka consumers, others in service code, and still more in analysts’ notebooks. Eventually nobody knows which discrepancy count is official.

7. Over-centralized canonicalization

A giant central model strips away source nuance needed for investigation. The system can tell that a mismatch happened, but not why.

Good architecture anticipates these failure modes and designs governance, lineage, and ownership around them. ArchiMate for governance

When Not To Use

Not every problem deserves a streaming reconciliation architecture.

Do not use streaming reconciliation when:

the domain settles only in periodic cycles and there is little value in early signals
sources cannot emit reliable identifiers or events
the organization lacks schema governance and SRE capability
discrepancy handling is entirely manual and low-frequency
the cost of false positives exceeds the value of low-latency detection
a simple batch control satisfies business, audit, and operational needs

Likewise, do not cling to batch-only reconciliation when:

customer harm occurs before nightly detection
fraud, inventory, or exposure risks accumulate rapidly
microservices produce rich events that are wasted
operations need immediate triage for missing lifecycle steps

Architecture should serve the economics of the domain. Not fashion. Not trauma from the last platform rewrite.

Several adjacent patterns show up repeatedly around reconciliation.

Event sourcing

Useful when the domain naturally benefits from immutable fact history and replay. But event sourcing does not eliminate reconciliation. It often sharpens it, because external systems still maintain their own state and timing.

CDC (change data capture)

A practical bridge for legacy systems that cannot emit domain events. CDC can feed both batch and stream lanes, though it captures state changes rather than business intent. Treat it as a migration aid, not a semantic substitute.

Outbox pattern

Helpful for making microservice event publication reliable. If reconciliation depends on service events, the outbox pattern reduces “database committed but event missing” failures.

Saga orchestration and compensation

Relevant when discrepancies trigger business corrections across services. Reconciliation may detect the issue; a saga or compensating workflow may resolve it.

Data quality rules engines

Useful, but narrower. Data quality checks validate structural or value constraints. Reconciliation validates consistency across systems and domain lifecycles.

Lakehouse medallion layers

Common in batch architectures. Bronze and silver layers can feed reconciliation, but the medallion model by itself does not define reconciliation semantics. It is storage discipline, not business truth.

Summary

Batch versus stream reconciliation is the wrong argument if taken literally. The real question is: what kind of truth does your business need, and when does it need it?

If you need early operational awareness, customer-facing consistency, and event-driven responsiveness, a streaming lane is powerful. Kafka, stateful processing, and microservices can make reconciliation part of the living fabric of the enterprise.

If you need completeness, certification, replayable controls, and audit-grade finality, the batch lane remains indispensable. It is not old-fashioned. It is often the mechanism by which the enterprise closes the books and keeps regulators calm.

The mature answer is usually both, designed intentionally.

Use domain-driven design to define the semantics of matching, lateness, ownership, and finality. Build a reconciliation bounded context instead of scattering logic across jobs and services. Migrate with a progressive strangler approach: instrument, shadow, compare, operationalize, then selectively shift authority. Keep the batch backstop where the domain requires it. Let the streaming lane earn trust before it carries consequence.

Most of all, remember this: reconciliation is not housekeeping. It is how an enterprise learns whether its distributed promises still add up.

And in data architecture, that is about as real as it gets.

Frequently Asked Questions

What is enterprise architecture?

Enterprise architecture aligns strategy, business processes, applications, and technology in a coherent model. It enables impact analysis, portfolio rationalisation, governance, and transformation planning across the organisation.

How does ArchiMate support architecture practice?

ArchiMate provides a standard language connecting strategy, business operations, applications, and technology. It enables traceability from strategic goals through capabilities and services to infrastructure — making architecture decisions explicit and reviewable.

What tools support enterprise architecture modeling?

The main tools are Sparx Enterprise Architect (ArchiMate, UML, BPMN, SysML), Archi (free, ArchiMate-only), and BiZZdesign. Sparx EA is the most feature-rich, supporting concurrent repositories, automation, scripting, and Jira integration.