⏱ 19 min read
There are few things more dangerous in enterprise architecture than a number everyone believes and nobody can explain.
A balance is off by 0.7%. A customer record exists in three systems with three different “truths.” Yesterday’s settlement file says one thing, the event log says another, and the dashboard—always eager, never humble—claims everything is green. This is the quiet mess behind modern data estates. Not the glamorous side of architecture. Not the conference-slide side. The side where finance waits, operations escalates, and architects discover that “near real time” and “correct” are very different promises.
That is where reconciliation lives.
And once you start talking about reconciliation, you immediately hit a hard architectural question: do you reconcile in a batch lane, where records are collected, compared, corrected, and published on a schedule? Or do you reconcile in a streaming lane, where discrepancies are detected and handled continuously as events move through the estate?
This is not a technology choice disguised as architecture. It is a domain choice. It is about how the business experiences truth, lateness, finality, and correction. It is about whether a transaction is considered “done” when it is emitted, when it settles, when a downstream system accepts it, or when the monthly close passes without argument.
Good architecture starts there. Not with Kafka. Not with Spark. Not with whether your cloud vendor has a shiny managed service this quarter. Start with the semantics of the business, because reconciliation is really the discipline of making domain promises explicit. event-driven architecture patterns
Context
Most enterprises do not have one data platform. They have layers of history pretending to be a platform.
There is the operational core: order management, payments, policy administration, claims, billing, ERP, CRM. There are microservices around the edges, each with its own database and a healthy sense of autonomy. There are SaaS platforms with export APIs. There are warehouses and lakehouses collecting facts after the fact. And somewhere in this landscape, business users assume there is a coherent answer to simple questions: microservices architecture diagrams
- How many orders shipped yesterday?
- Which payments are authorized but not settled?
- Which customer addresses are canonical?
- Which invoices are missing from the ledger?
- What is the exposure at this moment, not last night?
Reconciliation exists because distributed systems create multiple valid but incomplete perspectives. The order service may say “completed,” the payment service may say “captured,” the ledger may not yet reflect the posting, and the fulfillment platform may have retried the same event twice. Each system is locally sensible. The enterprise view is not.
Historically, batch reconciliation was the answer. End-of-day files. Nightly ETL. Control totals. Compare counts and sums. Produce exception reports. This approach survives because it matches how many enterprises actually work: periodic processes, governed checkpoints, and human review.
Streaming changed the ambition. Instead of waiting until the end of the day to discover mismatches, teams now want continuous visibility. Events flow through Kafka, services react in real time, stateful processors detect missing joins or amount mismatches, and reconciliation becomes an always-on capability rather than a nightly ritual.
But ambition is cheap. Operations are expensive. A streaming reconciliation architecture can reduce detection latency dramatically, but it also introduces new questions around ordering, event completeness, late arrivals, duplicate handling, replay, semantic versioning, and the boundary between “provisional truth” and “final truth.”
That is why the comparison between batch lane and streaming lane matters. They are not merely different pipelines. They are different operating models for truth.
Problem
The core problem is deceptively simple: how do we ensure that business facts represented across multiple systems remain consistent enough for the enterprise to operate safely?
In practice, that breaks into several harder questions:
- How do we know two records represent the same business event?
- When is it valid to compare them?
- What counts as a mismatch: missing record, value difference, timing difference, or semantic disagreement?
- Can mismatches be corrected automatically, or do they require investigation?
- How do we preserve an audit trail of what was known, when, and why it changed?
This is where many implementations go wrong. They treat reconciliation as a low-level data comparison exercise—row counts, checksums, and field-by-field equality. Useful, yes. Sufficient, no.
Reconciliation is about domain semantics, not just record mechanics.
A payment authorization and a settlement are not expected to match at the same moment. An order total may differ from the invoice total for valid reasons such as tax recalculation or partial fulfillment. A customer record may diverge temporarily across systems because one domain owns identity, another owns communication preferences, and a third owns credit risk. If you compare without understanding ownership and lifecycle, you produce noise. Enterprises drown in noise.
The true architecture problem is to build a reconciliation capability that can distinguish:
- Expected divergence from actual defects
- Temporal inconsistency from permanent mismatch
- Domain correction from technical failure
- Authoritative source disagreement from data transport issues
That is why reconciliation belongs in enterprise architecture and domain-driven design, not just in the plumbing team.
Forces
Several forces shape the choice between batch and stream reconciliation.
1. Business tolerance for latency
If the business can accept discrepancies being discovered tomorrow morning, batch remains a strong option. If fraud detection, inventory exposure, trading risk, or customer entitlements require immediate confidence, streaming becomes attractive.
Latency is not free, but neither is immediacy.
2. Finality versus provisional state
Some domains are naturally eventful but not final in real time. Payments, logistics, healthcare claims, and insurance are full of provisional statuses. A stream may tell you what is happening now, but the batch may still be the mechanism by which the organization asserts what is final.
This distinction matters. Many teams build streaming reconciliation for domains whose truth is only settled in periodic cycles. They end up building a very expensive early-warning system and still relying on batch for financial or regulatory closure.
3. Volume and cardinality
Batch handles massive historical comparison well, especially when full recomputation is acceptable. Streaming shines when continuously matching records with manageable state windows or keyed joins.
If you need to reconcile billions of records with long-tail arrival patterns over 90 days, a naive streaming design becomes a state management nightmare.
4. Data quality and source behavior
Streams assume relatively disciplined event production: identifiers, ordering strategy, schemas, idempotency, and ownership. Enterprises often have the opposite. Mainframes emit files, SaaS platforms emit snapshots, and legacy systems mutate records without meaningful change history.
When sources are poor event citizens, batch may be the honest answer.
5. Operating model maturity
Batch can be run by a smaller, more traditional data operations model. Streaming requires stronger SRE practices, schema governance, replay discipline, event contract management, and robust observability. EA governance checklist
Streaming is not just a pipeline style. It is a commitment.
6. Audit and explainability
Reconciliation is often tied to controls, compliance, and external audit. Batch offers natural checkpoints and reproducible runs. Streaming can also be auditable, especially with immutable event logs, but only if the enterprise is disciplined about retention, replayability, and deterministic processing.
Solution
The best solution in most enterprises is not batch or stream. It is a dual-lane reconciliation architecture with explicit semantics for each lane.
Think of it as two roads serving different purposes:
- The streaming lane is the fast lane. It detects likely discrepancies early, raises operational alerts, enriches downstream actions, and supports near-real-time decision-making.
- The batch lane is the settlement lane. It performs complete, authoritative, replayable reconciliation over defined business periods and produces controlled exceptions, corrections, and audit artifacts.
This is the architecture adults build.
The mistake is to force one lane to do the other’s job. Streaming should not be burdened with proving final financial truth if the domain settles overnight. Batch should not be expected to support customer-facing decisions that depend on immediate discrepancy detection.
A sound design makes their roles explicit:
Streaming lane responsibilities
- Ingest domain events from Kafka or equivalent event backbone
- Correlate events across services using business keys
- Detect missing expected events within time windows
- Flag suspicious amount, status, or sequence mismatches
- Publish discrepancy events to operational workflows
- Provide provisional reconciled views
Batch lane responsibilities
- Ingest complete extracts, snapshots, or durable event history
- Reconcile across full business periods
- Recompute balances, counts, and control totals
- Validate financial and regulatory assertions
- Produce authoritative exceptions and case queues
- Maintain audit evidence and sign-off records
The key phrase is provisional versus authoritative. If you do not name that distinction in your architecture, the business will discover it the hard way.
Architecture
A practical architecture usually combines event-driven processing, domain-aligned reconciliation rules, and a batch backstop.
At the heart of the architecture sits a reconciliation domain, not just a set of jobs. This domain should have its own ubiquitous language:
- reconciliation case
- expected event
- tolerance window
- provisional match
- authoritative match
- exception reason
- business key
- correction action
- control total
- settlement period
This is classic domain-driven design thinking. Reconciliation is often treated as an afterthought spread across ETL scripts, Kafka consumers, and BI SQL. That produces fragmented logic and contradictory results. Instead, model reconciliation as its own bounded context, with clear integration points to source domains like Orders, Payments, Ledger, and Customer.
Domain semantics first
Suppose an e-commerce company wants to reconcile orders and payments. A simplistic architecture compares order amount with payment amount. A better domain model asks:
- Is reconciliation at order level, payment instruction level, capture level, or settlement level?
- Are split tenders allowed?
- Are partial shipments valid before final invoice?
- Does “paid” mean authorized, captured, or settled?
- Are chargebacks part of the same lifecycle or a separate reconciliation cycle?
Without these semantics, the technical design is theater.
Streaming lane design
In the streaming lane, Kafka is often the right backbone because it preserves ordered partitions, supports replay, and allows multiple consumers. But Kafka does not solve semantics. It merely gives you a durable log.
A streaming reconciliation service typically:
- consumes related domain topics
- normalizes events into reconciliation facts
- joins or correlates by business key
- tracks expected versus observed lifecycle milestones
- emits discrepancy events or provisional reconciled records
For example:
OrderPlacedPaymentAuthorizedPaymentCapturedOrderShippedInvoiceIssuedLedgerPosted
The service maintains state keyed by orderId or paymentId, with configurable windows and tolerances. If OrderPlaced occurs but no PaymentAuthorized appears within 5 minutes, raise an operational exception. If PaymentCaptured exceeds InvoiceIssued by more than tolerated tax adjustment, raise a mismatch.
This is not “stream processing” in the abstract. It is lifecycle verification.
Batch lane design
The batch lane works differently. Here the goal is completeness, repeatability, and control.
A batch reconciliation engine typically:
- loads snapshots, files, or event-history slices for a period
- aligns records to canonical business keys
- applies deterministic reconciliation rules
- calculates control totals and variance reports
- persists exception records with lineage and evidence
- supports reruns for the same accounting or business period
This lane often lands in the warehouse, lakehouse, or a dedicated reconciliation platform. It may consume Kafka-compacted topics or event archives as inputs, but the processing posture is different: stop, compare, explain, certify.
Canonical model or not?
Architects often reach for a canonical enterprise model here. Use restraint.
A thin canonical reconciliation model is useful: business key, event type, amount, effective time, source system, status, correlation identifiers. But a giant enterprise-wide canonical schema usually becomes a bureaucratic museum piece. Prefer semantic normalization at the boundary, while preserving source-specific detail for investigation.
Migration Strategy
No serious enterprise moves from batch-only reconciliation to streaming reconciliation in one jump. That path ends in a war room.
The right move is a progressive strangler migration.
Start by acknowledging what the batch system already does well: it provides trusted controls, finance confidence, and operational muscle memory. Do not rip that out because streaming feels modern. Add a streaming lane alongside it, initially as an observational capability.
A sensible migration path looks like this:
Phase 1: Instrument the domain
Identify the highest-value reconciliation journeys—payments, orders, ledger postings, inventory movements. Capture events from source systems through native publishing, CDC, or integration wrappers. At this stage, do not overpromise real-time correction. Focus on event quality and identifiers.
Phase 2: Build shadow streaming reconciliation
Run stream processors in parallel with the existing batch controls. Produce discrepancy signals, but do not let them drive formal business decisions yet. Measure:
- false positives
- false negatives
- late-arrival behavior
- duplicate event rates
- match-rate confidence by source
This phase is humbling. Good. Architecture should be humbled before production is.
Phase 3: Move operational use cases first
Use streaming reconciliation for early-warning and operational triage:
- missing payment confirmations
- delayed shipment events
- duplicate invoicing signals
- customer entitlement mismatches
These are domains where early detection matters more than legal finality.
Phase 4: Shift narrow authoritative checks
Only after confidence is high should selected authoritative controls move from batch to streaming or micro-batch, and even then, only where the domain supports real-time finality.
Phase 5: Retain batch where it belongs
Do not force full retirement. In many enterprises, batch remains the right architecture for monthly close, regulated reporting, and historical recomputation. Mature architecture is selective, not ideological.
Enterprise Example
Consider a global retail bank modernizing card payment reconciliation.
The bank had:
- a core card authorization platform
- a settlement platform
- a ledger posting system
- fraud services
- customer channels
- a nightly reconciliation engine fed by flat files
The nightly process compared authorization counts, settlement totals, chargebacks, and ledger postings. It was trusted, but painfully slow. Fraud teams could not act quickly enough on certain anomalies. Operations discovered missing postings the next day. Customers saw pending transactions that did not align with account views.
A fashionable answer would have been “put everything on Kafka and reconcile in real time.” That would have been reckless.
The bank instead modeled three distinct reconciliation domains:
- Operational payment lifecycle reconciliation
Detect missing or duplicate authorization, capture, and reversal events within minutes.
- Customer-view consistency reconciliation
Ensure digital channel balances and transaction views reflect the latest valid status, even if settlement is not complete.
- Financial settlement reconciliation
Reconcile settlement files, fee calculations, chargebacks, and ledger entries for accounting finality.
The first two moved to a streaming lane. Events from card services, fraud microservices, and channel APIs flowed into Kafka. Stream processors correlated by payment reference, card token, and merchant sequence numbers. Cases were raised within minutes for duplicate captures, missing reversals, and stale customer-channel states.
The third remained batch-oriented. Settlement files arrived from networks on periodic cycles. The ledger was reconciled by business date, not event time. Batch jobs computed authoritative balances and generated audit evidence for finance.
This hybrid design produced real value:
- operational discrepancy detection dropped from hours to minutes
- customer-facing transaction accuracy improved substantially
- finance retained trusted settlement controls
- the architecture team avoided trying to pretend settlement finality existed before settlement actually occurred
That is the point. The architecture matched the domain.
Operational Considerations
This is where good designs often die.
State management
Streaming reconciliation maintains state, and state is where optimism goes to suffer. Long windows increase completeness but consume more resources and complicate recovery. Short windows reduce cost but increase false discrepancies due to late arrivals.
Choose windows based on actual domain lateness, not guesswork.
Idempotency and duplicates
Many reconciliation defects are self-inflicted by duplicate consumption, replay side effects, or producer retries. Every reconciliation service should be idempotent. Every event should carry stable business identifiers and source event identifiers. If your ecosystem cannot provide that, expect pain.
Ordering
Kafka gives order within a partition, not across the enterprise. If events that must be compared are partitioned inconsistently, your stream logic will become awkward or wrong. Partitioning strategy is architectural, not incidental.
Schema evolution
Reconciliation logic is fragile when event contracts drift. Use schema registries, compatibility policies, and explicit version handling. A renamed status code can quietly invalidate business rules and flood case queues.
Observability
You need more than CPU and lag dashboards. Reconciliation observability should include:
- match rates by domain and source
- discrepancy rates by reason code
- late-arrival distributions
- duplicate rates
- replay counts
- batch-versus-stream variance
If you cannot explain why discrepancy volume changed this week, you are not operating the system; you are merely hosting it.
Human workflow
Some discrepancies can be auto-corrected. Many should not be. Build case management, assignment rules, evidence capture, and feedback loops into the architecture. Reconciliation is a socio-technical system. The exception queue is part of the design.
Tradeoffs
Batch lane and streaming lane are both compromises. The question is which compromise fits the domain.
Batch advantages
- simpler operational model
- natural fit for periodic business controls
- easier reproducibility and audit checkpoints
- good for large historical recomputation
- tolerant of weaker source event quality
Batch disadvantages
- slow discrepancy detection
- poor support for customer-facing immediacy
- delayed remediation
- often dependent on brittle file schedules
- tends to hide temporal dynamics until too late
Streaming advantages
- fast detection and response
- better support for event-driven microservices
- aligns with operational workflows
- can provide continuous visibility
- supports reactive automation
Streaming disadvantages
- higher complexity in state, replay, and correctness
- more sensitive to event quality and identifiers
- difficult with long settlement cycles
- can create false confidence in provisional truth
- operationally demanding
The sharpest tradeoff is this: streaming improves timeliness; batch improves completeness. You can mitigate that tension, but you do not erase it.
Failure Modes
Architects should talk more about failure modes. Systems fail in characteristic ways, and reconciliation systems fail in particularly embarrassing ones.
1. Semantic mismatch dressed as technical defect
Teams compare statuses or amounts without understanding lifecycle meaning. The system raises thousands of “errors” that are actually valid temporal states.
2. Missing business keys
No stable correlation identifier exists across systems. Reconciliation devolves into fuzzy matching, heuristics, and arguments. This is often a domain modeling failure upstream.
3. Late data floods exception queues
A stream processor with aggressive time windows marks events as missing, only for them to arrive later. Operations stop trusting the alerts.
4. Replay creates phantom discrepancies
Historical event replay is treated as fresh business activity because consumers are not replay-aware or idempotent. Suddenly the enterprise appears to have doubled its exceptions.
5. Batch and stream disagree with no arbitration model
Both lanes exist, but there is no rule for which result is authoritative in which context. The architecture produces two truths and calls it resilience.
6. Reconciliation logic is scattered
Some rules live in SQL jobs, others in Kafka consumers, others in service code, and still more in analysts’ notebooks. Eventually nobody knows which discrepancy count is official.
7. Over-centralized canonicalization
A giant central model strips away source nuance needed for investigation. The system can tell that a mismatch happened, but not why.
Good architecture anticipates these failure modes and designs governance, lineage, and ownership around them. ArchiMate for governance
When Not To Use
Not every problem deserves a streaming reconciliation architecture.
Do not use streaming reconciliation when:
- the domain settles only in periodic cycles and there is little value in early signals
- sources cannot emit reliable identifiers or events
- the organization lacks schema governance and SRE capability
- discrepancy handling is entirely manual and low-frequency
- the cost of false positives exceeds the value of low-latency detection
- a simple batch control satisfies business, audit, and operational needs
Likewise, do not cling to batch-only reconciliation when:
- customer harm occurs before nightly detection
- fraud, inventory, or exposure risks accumulate rapidly
- microservices produce rich events that are wasted
- operations need immediate triage for missing lifecycle steps
Architecture should serve the economics of the domain. Not fashion. Not trauma from the last platform rewrite.
Related Patterns
Several adjacent patterns show up repeatedly around reconciliation.
Event sourcing
Useful when the domain naturally benefits from immutable fact history and replay. But event sourcing does not eliminate reconciliation. It often sharpens it, because external systems still maintain their own state and timing.
CDC (change data capture)
A practical bridge for legacy systems that cannot emit domain events. CDC can feed both batch and stream lanes, though it captures state changes rather than business intent. Treat it as a migration aid, not a semantic substitute.
Outbox pattern
Helpful for making microservice event publication reliable. If reconciliation depends on service events, the outbox pattern reduces “database committed but event missing” failures.
Saga orchestration and compensation
Relevant when discrepancies trigger business corrections across services. Reconciliation may detect the issue; a saga or compensating workflow may resolve it.
Data quality rules engines
Useful, but narrower. Data quality checks validate structural or value constraints. Reconciliation validates consistency across systems and domain lifecycles.
Lakehouse medallion layers
Common in batch architectures. Bronze and silver layers can feed reconciliation, but the medallion model by itself does not define reconciliation semantics. It is storage discipline, not business truth.
Summary
Batch versus stream reconciliation is the wrong argument if taken literally. The real question is: what kind of truth does your business need, and when does it need it?
If you need early operational awareness, customer-facing consistency, and event-driven responsiveness, a streaming lane is powerful. Kafka, stateful processing, and microservices can make reconciliation part of the living fabric of the enterprise.
If you need completeness, certification, replayable controls, and audit-grade finality, the batch lane remains indispensable. It is not old-fashioned. It is often the mechanism by which the enterprise closes the books and keeps regulators calm.
The mature answer is usually both, designed intentionally.
Use domain-driven design to define the semantics of matching, lateness, ownership, and finality. Build a reconciliation bounded context instead of scattering logic across jobs and services. Migrate with a progressive strangler approach: instrument, shadow, compare, operationalize, then selectively shift authority. Keep the batch backstop where the domain requires it. Let the streaming lane earn trust before it carries consequence.
Most of all, remember this: reconciliation is not housekeeping. It is how an enterprise learns whether its distributed promises still add up.
And in data architecture, that is about as real as it gets.
Frequently Asked Questions
What is enterprise architecture?
Enterprise architecture aligns strategy, business processes, applications, and technology in a coherent model. It enables impact analysis, portfolio rationalisation, governance, and transformation planning across the organisation.
How does ArchiMate support architecture practice?
ArchiMate provides a standard language connecting strategy, business operations, applications, and technology. It enables traceability from strategic goals through capabilities and services to infrastructure — making architecture decisions explicit and reviewable.
What tools support enterprise architecture modeling?
The main tools are Sparx Enterprise Architect (ArchiMate, UML, BPMN, SysML), Archi (free, ArchiMate-only), and BiZZdesign. Sparx EA is the most feature-rich, supporting concurrent repositories, automation, scripting, and Jira integration.