⏱ 20 min read
Most data platforms look elegant in the first slide deck and feral in production.
On the whiteboard, events flow neatly from producers to streams to warehouses to dashboards. Boxes are clean. Arrows are straight. Everyone nods. Then reality arrives: a bug in enrichment logic, a schema change that was “backward compatible” until it wasn’t, a partner feed that silently dropped 4% of records for six days, or a regulator asking for a corrected position report based on the rules that were valid last Tuesday, not today. Suddenly the question is no longer how data moves. The real question is whether the platform can rethink the past.
That is what reprocessing architecture is about.
Reprocessing is often treated as a technical afterthought: replay some Kafka topics, rerun a batch job, backfill a table, and hope nobody notices. That mindset is expensive. Reprocessing is not merely rerunning code. It is a business capability. It sits at the fault line between domain semantics, operational resilience, and architectural honesty. If your pipeline cannot safely revisit earlier facts and derive corrected outcomes, then your data platform is not robust. It is just fast on good days.
The hard part is this: “reprocessing” means very different things depending on the domain. In retail, it may mean recalculating order margins after tax rules change. In banking, it may mean rebuilding ledger positions from immutable transaction events with the exact reference data that was valid at a point in time. In insurance, it may mean correcting claim eligibility decisions after a policy mapping defect. The business does not care whether the mechanism is replay, backfill, compaction, checkpoint reset, or temporal query. It cares whether the corrected result is accurate, auditable, and explainable.
That is where architecture earns its keep.
This article lays out a pragmatic architecture for reprocessing in data pipelines: not a theoretical ideal, but a shape that survives enterprise gravity. We will look at the forces that make reprocessing difficult, the domain-driven design choices that prevent semantic drift, the role of Kafka and microservices, a progressive strangler migration path, reconciliation as a first-class concern, and the tradeoffs and failure modes that tend to ambush teams who think replaying messages is the same as repairing a business process. event-driven architecture patterns
Context
Modern data pipelines rarely serve one purpose. They support operational decisions, analytics, machine learning features, regulatory reporting, customer communications, and increasingly near-real-time process automation. The same underlying event stream might feed fraud scoring, inventory reservation, customer notifications, and finance.
That multiplicity is exactly why reprocessing becomes painful.
A pipeline that only pushes data forward can ignore many inconvenient truths. It can bake logic into jobs, let schemas drift, depend on mutable reference tables, and rely on side effects hidden inside consumer services. But the moment you need to reprocess historical data, every shortcut becomes visible. Did you preserve the original event? Can you reconstruct the business context at the time? Are downstream actions idempotent? Can you distinguish “corrected output” from “duplicate output”? Which version of the transformation logic should apply: the old one, the new one, or both for comparison?
In enterprise systems, reprocessing usually appears in one of five scenarios:
- Bug correction: a transformation or enrichment defect produced wrong outputs.
- Late-arriving data: upstream systems delivered records after the business decision window.
- Schema or rule evolution: logic changed and historical outputs must be recalculated.
- Data recovery: an outage, consumer lag, or corrupted sink left gaps.
- Regulatory or audit reconstruction: the business needs to reproduce outputs from historical facts.
Each scenario sounds similar operationally. They are not similar semantically. A replay to recover a failed Elasticsearch index is not the same thing as restating customer commission calculations. Good architecture starts by admitting that “reprocessing” is not one thing.
Problem
Most pipeline estates are built for throughput, not for correction.
Teams optimize for producer decoupling, low-latency streaming, and rapid downstream integration. Kafka topics proliferate. Microservices subscribe. Data lands in lakes, warehouses, search indexes, and caches. Some consumers are stateless. Many are not. Reference data gets read from mutable stores. Decisions become side effects. Over time, the estate becomes a machine that can only move in one direction. microservices architecture diagrams
Then a defect appears.
At that point, the system usually lacks at least one of these essentials:
- A durable source of truth for original business events
- Versioned transformation logic or at least traceability of rule versions
- Temporal reference data to reconstruct historical context
- Idempotent consumers so replay does not duplicate effects
- Scoped replay mechanisms that can target affected subsets
- Reconciliation processes to prove corrected outputs are complete and accurate
Without these, reprocessing becomes a dangerous ritual. Teams reset offsets. Backfill scripts are written under pressure. Data engineers run one-off SQL updates. Operations manually compare counts. Business users discover mismatches days later.
The architecture problem is not simply how to rerun data. It is how to make historical correction safe, bounded, explainable, and cheap enough to use before an incident becomes a crisis.
Forces
Reprocessing sits under several competing forces. This is why simplistic advice tends to fail.
1. Immutability versus business correction
Architects love immutable events because they preserve facts. Business users love corrected outputs because they reflect reality as it should have been interpreted. Both matter. The tension is that facts should remain unchanged, while derived views often must change.
A useful rule: do not mutate facts to repair interpretations. Keep the original event immutable. Create revised derived state.
2. Throughput versus replayability
Optimizing for real-time performance often encourages shortcuts: in-memory joins, mutable lookups, ephemeral state, and “good enough” schemas. These choices speed the happy path and poison the replay path.
Replayability has a cost. You pay in storage, metadata, traceability, and discipline.
3. Technical events versus domain events
Many Kafka topics are not domain events at all. They are integration exhaust: customer-updated-v7, db-change-log, enrichment-output, and other technically convenient but semantically weak messages. These are poor anchors for reprocessing because they often lack business meaning.
Domain-driven design matters here. Reprocessing architecture should be built around business facts and domain invariants, not around whatever happened to be emitted by a connector.
4. Full replay versus targeted repair
A full historical replay sounds elegant. It is often reckless.
Enterprise pipelines are large, shared, and expensive. Replaying everything may overload downstream systems, regenerate already-correct outputs, and create ambiguity around what changed. The practical architecture supports selective reprocessing by time window, entity set, correlation key, defect version, or business case.
5. Accuracy versus operational risk
The more downstream actions a pipeline triggers—notifications, payouts, risk flags, partner calls—the more dangerous replay becomes. Correction may require suppressing side effects, generating compensations, or routing replay outputs down a different lane.
This is why reprocessing is not merely a data concern. It is a process concern.
Solution
The core solution is simple to say and hard to implement well:
Separate immutable business facts from derived projections, and design a controlled reprocessing lane that can rebuild projections from facts using explicit versions of logic and context.
That sentence carries the whole architecture.
A solid reprocessing design has six elements:
- Canonical domain event store or event backbone
Preserve business facts in immutable form. Kafka can serve as the transport and short-to-medium retention log, but for enterprise replay horizons you usually also need durable archival storage in object store, lakehouse, or event repository.
- Versioned processing stages
Transformations should be deployable and traceable by version. A reprocessing run must know which logic version it used.
- Temporal reference data
Joins to customer hierarchies, pricing, policy mappings, FX rates, or product taxonomies should be reconstructable “as of” a point in time, or explicitly version-pinned.
- Scoped replay orchestration
Reprocessing is initiated through a control plane, not by ad hoc offset surgery. The control plane defines scope, reason, input set, code version, and output target.
- Idempotent or compensating consumers
Downstream systems must either accept duplicate-safe updates or support compensations when corrections invalidate prior effects.
- Reconciliation and audit trail
Every reprocessing run needs measurable proof: records considered, records replayed, records corrected, records rejected, business totals before and after, and unresolved exceptions.
When teams miss one of these, they usually compensate with heroics. Heroics do not scale.
Architecture
The architecture that works in practice is usually a dual-lane design: a forward processing lane for normal flow and a reprocessing lane for controlled correction. They share core transformation components but differ in orchestration, isolation, and side-effect handling.
Reference architecture
This is not an event-sourcing manifesto. It is a practical split of responsibilities.
- Kafka domain topics carry durable business events for operational integration.
- Raw event archive stores those facts for longer replay horizons and bulk backfills.
- Forward processing services power normal consumers and projections.
- Reprocessing services use the same core business logic where possible, but execute under explicit controls: scoped inputs, pinned reference data, isolated outputs, and reconciliation.
- Promotion / Merge moves validated corrected outputs into production views, often via replace, merge-upsert, version switch, or compensation.
Domain semantics first
This is where domain-driven design becomes essential rather than decorative.
If your reprocessing input is a CDC stream of row changes from ten upstream tables, you are starting from plumbing, not domain meaning. You can build a pipeline on that. You cannot build a trustworthy reprocessing architecture on it without pain.
Instead, define bounded contexts and identify the domain facts that matter:
OrderPlacedPaymentCapturedShipmentDispatchedTradeBookedPremiumCalculatedClaimRegistered
These are semantically stable anchors. They let you ask business questions like: “Recalculate premium outputs for all policies where rating rule R17 was applied between 1 and 7 March.” That is the right level of abstraction. Reprocessing should be expressible in domain language, not only in partition and offset language.
A memorable line worth keeping: Offsets are not business scope.
Reprocessing control plane
Do not let engineers trigger replay by directly manipulating consumers unless you enjoy incident reviews.
A proper control plane captures:
- business reason for reprocessing
- affected bounded context
- input scope
- replay mode: simulation, dry run, corrective run
- code and rule version
- reference data version or time-travel policy
- output destination
- side-effect policy
- reconciliation checks
- approval workflow for sensitive domains
This can be implemented as a workflow service, orchestration job, internal platform API, or even a governed batch framework. The technology matters less than the discipline.
Reconciliation as a first-class capability
Most teams tack reconciliation on at the end. That is backward. Reprocessing without reconciliation is just a more elaborate way to create uncertainty.
Reconciliation must answer at least four questions:
- Did we process the intended input set?
- Did we produce the expected number of outputs?
- Do business aggregates align with source facts and prior state?
- Which records remain unresolved and why?
This is where counts alone are dangerous. Enterprise reconciliation needs both technical controls and business controls.
- Technical: input count, output count, duplicates, rejects, lag, partition completeness
- Business: monetary totals, position sums, policy counts, invoice amounts, entity-level comparisons
A pipeline can pass technical reconciliation and still be wrong in business terms. That is a common failure mode.
Side-effect isolation
One of the oldest sins in microservice estates is mixing state derivation with irreversible side effects. A consumer reads an event, updates a database, sends an email, calls a partner API, and publishes a new event. That seems efficient until replay arrives and everything happens again.
For reprocessing, separate:
- state projections that can be rebuilt
- business decisions that may need review
- external side effects that should be suppressed, deduplicated, or compensated
A replay lane often writes corrected state to staging projections first. Downstream notifications or partner interactions are then emitted only after validation.
Reprocess flow
This pattern gives reprocessing an explicit lifecycle. That matters because correction is often more sensitive than normal processing.
Migration Strategy
Very few enterprises get to design this cleanly from scratch. Most inherit a tangle of batch jobs, Kafka consumers, ETL workflows, warehouse procedures, and service-owned databases. So the migration strategy must be progressive. This is a strangler, not a coup.
Step 1: Identify the true system of fact
Begin by locating the rawest durable representation of the business event that can serve as replay input. Sometimes it is already in Kafka. Sometimes in an object store landing zone. Sometimes in a mainframe extract. Sometimes nowhere useful, which is your first architectural smell.
If the fact is missing, create a raw immutable capture layer before anything else.
Step 2: Separate facts from projections
Document the key derived datasets and classify them:
- rebuildable projection
- externally committed effect
- mixed concern that needs refactoring
Do not attempt universal purity. Triage. Start with the projections that are expensive to repair manually or critical to audit.
Step 3: Introduce replay-safe transformation components
Refactor transformation logic so the same business rules can run in both forward and replay modes. Keep orchestration separate, but avoid duplicating core calculation logic. Otherwise drift between “normal” and “reprocess” code paths will undermine trust.
Step 4: Add temporal semantics to reference data
This is usually the hardest migration step. Many enterprises discover that their biggest replay problem is not event retention but mutable lookup tables. If a customer segment or pricing rule table is overwritten daily, you cannot truthfully reconstruct prior outcomes.
Introduce slowly changing dimensions, bitemporal models, versioned reference snapshots, or evented reference changes where needed. Not everywhere—only where the domain requires reconstruction fidelity.
Step 5: Build the control plane and reconciliation services
Start lightweight if needed. Even a governed job catalog with scoped parameters, approvals, and run metadata is better than shell scripts and tribal memory.
Step 6: Strangle one domain at a time
Pick a bounded context where reprocessing pain is real and business sponsorship exists. Build the new lane there. Prove faster correction, cleaner audit, and safer replay. Then expand.
That is how architectural reform survives contact with enterprise budgets.
Progressive strangler view
Notice the order. We are not starting with a grand platform. We are strangling the riskiest habits first.
Enterprise Example
Consider a global insurance carrier processing policy events across regional systems. New business, endorsements, renewals, cancellations, and claims flow from policy administration platforms into Kafka. Downstream consumers calculate premium, commissions, billing schedules, solvency reporting feeds, and broker statements.
A defect is introduced in the premium calculation service. For seven days, a new rule misclassifies a subset of commercial property policies in Germany and applies the wrong risk loading. The error affects downstream invoices, commission statements, and regulatory exposure reporting.
This is a classic enterprise reprocessing problem because the correction is not simply “rerun the job.”
What goes wrong in a weak architecture
In a weak architecture, teams might:
- reset Kafka consumer groups and replay all policy events
- rerun premium calculations using today’s reference data, not last week’s
- generate duplicate invoice events
- trigger broker notifications again
- find that some downstream systems consumed old events and some consumed new ones
- spend two weeks reconciling financial totals manually
The technology may be modern. The operating model is medieval.
What the stronger architecture does
In a stronger design:
- Domain facts exist as immutable events
PolicyBound, CoverageEndorsed, PolicyRenewed, PolicyCancelled are archived with business keys and event timestamps.
- Reference data is time-aware
The underwriting factor tables and broker hierarchy mappings are available as-of the original calculation date.
- The control plane scopes the run
Only commercial property policies in Germany affected by rule version RISK_LOAD_2026_03_01 between defined timestamps are selected.
- Replay writes to corrected projections
Premium outputs are recalculated in an isolated corrected projection table or topic.
- Side effects are suppressed
No customer letters, broker emails, or billing exports are sent during replay.
- Reconciliation compares business totals
The run verifies policy counts, premium deltas, commission impacts, and exposure totals.
- Promotion applies controlled corrections
Billing gets compensating adjustments; broker statements receive corrected deltas; regulatory feed is restated with audit references.
The important thing here is that reprocessing is aligned to the insurance domain. It knows what a corrected premium means, what downstream commitments exist, and which outputs can be replaced versus compensated.
That is the difference between a platform that can replay bytes and one that can repair business truth.
Operational Considerations
Reprocessing architecture lives or dies in operations.
Storage and retention
Kafka retention alone is rarely enough for enterprise correction windows. You usually need tiered storage or archival copies in object storage with partitioning by event date, domain, and perhaps legal entity. Compression is cheap. Missing history is expensive.
Run isolation
Large reprocessing runs can starve normal processing if they share the same clusters, topics, or sink capacity. Isolate with separate consumer groups, separate compute pools, throttling, and back-pressure policies. Correction should not create a second outage.
Observability
You need more than generic pipeline metrics.
Track:
- replay input scope and actual input count
- processing version and configuration
- temporal reference versions used
- duplicate detection outcomes
- rejection reasons
- projection differences before and after
- reconciliation status by business dimension
If an operator cannot explain what happened in a replay run within minutes, the architecture is too opaque.
Governance and approvals
In regulated industries, reprocessing can alter financial statements, customer communications, or audit artifacts. Not every replay should be a self-service button. Governance should scale with risk. A search index rebuild does not need the same approvals as a premium restatement. EA governance checklist
Testing strategy
Replay logic must be tested with production-like historical slices, not only synthetic happy-path events. Include:
- old schema versions
- malformed payloads
- partial upstream outages
- temporal lookup changes
- duplicate and out-of-order events
- downstream sink failures mid-run
Reprocessing is where forgotten edge cases line up for revenge.
Tradeoffs
Every architecture choice here buys safety at a cost.
More storage, more metadata, more discipline
Immutable retention, versioned logic, and temporal reference data all increase platform complexity. That is real overhead. Teams who only need transient analytics may not need this machinery.
Slower development in exchange for recoverability
Designing around domain events and replay-safe components requires more careful modeling than shoving CDC into a warehouse and hoping transformations sort it out. It feels slower. In practice it is slower at first and much faster when defects happen.
Greater separation versus more moving parts
A dedicated control plane, reconciliation service, and isolated replay lane introduce more components. Purists may complain. Operations people usually prefer explicit mechanisms to implicit chaos.
Exact historical reconstruction versus pragmatic correction
Sometimes the business needs “recompute using the rules that were valid then.” Sometimes it only needs “recalculate based on the corrected rule and fix current state.” These are different goals. Full temporal fidelity is costly. Do not pay for it where a simpler correction is enough.
A mature architect asks: what kind of truth does the domain require?
Failure Modes
Reprocessing architectures fail in predictable ways. Most are self-inflicted.
1. Replaying technical noise instead of business facts
If your replay source is a swamp of low-level change events, you will reconstruct behavior poorly and inconsistently. Reprocessing should anchor on business facts wherever possible.
2. Using current reference data for historical correction
This is perhaps the most common error. The event is old; the lookup is new; the output is neither historically correct nor operationally consistent.
3. Duplicate side effects
A replay that resends payments, emails, partner messages, or tickets is not a replay. It is a new incident.
4. Hidden non-determinism
If transformations call mutable APIs, rely on wall-clock time, read random configuration, or depend on unordered data access, replay results will drift. Deterministic processing matters.
5. Full replay as the default hammer
The ability to replay everything is seductive. It is also often unnecessary and risky. Broad replays hide root causes and magnify blast radius.
6. Reconciliation by row count only
Equal counts can mask wrong values. Especially with money, balances, and rates, business reconciliation must be explicit.
7. Separate codebases for forward and replay logic
Teams often build a one-off backfill job under pressure. It solves the incident and creates a permanent divergence. Six months later nobody trusts either path.
When Not To Use
Not every pipeline deserves a full reprocessing architecture.
Do not use this pattern when:
- the data products are low-value, disposable, or exploratory
- outputs are not used for operational or regulated decisions
- source systems can regenerate authoritative outputs more cheaply
- historical reconstruction has no business value
- the cost of temporal reference management exceeds the domain benefit
- a simple append-only warehouse backfill solves the real need
For example, if a marketing clickstream powers non-critical experimentation dashboards with a seven-day retention horizon, a heavy replay control plane and temporal dimension strategy may be overengineering. Accept that some defects are corrected forward only.
Architects should know when to stop. Not every shed needs a suspension bridge.
Related Patterns
Several adjacent patterns often get confused with reprocessing architecture. They are related, but not identical.
Event sourcing
Event sourcing provides a strong foundation when aggregates are modeled around domain events and state is rebuilt from those events. It helps, but enterprise data estates are broader than a single service’s aggregate store. Reprocessing across analytics, integrations, and shared reference data still needs orchestration and reconciliation.
CQRS
CQRS separates write models from read projections, which aligns nicely with rebuildable projections. But CQRS alone does not solve historical context, side effects, or enterprise replay governance. ArchiMate for governance
Lambda and Kappa architectures
These older stream-processing patterns addressed batch and streaming convergence. They are relevant conceptually, especially for backfills and replay, but often miss the domain semantics and operational controls needed in modern enterprise estates.
Outbox pattern
The outbox pattern helps preserve reliable event publication from transactional services. It strengthens the quality of facts entering the pipeline. That is excellent groundwork for replayability.
Sagas and compensating transactions
When reprocessing invalidates downstream actions, sagas and compensations become essential. Especially in microservices, not every correction is a replace; many are a reverse-and-reissue.
Summary
Reprocessing is where data architecture stops being decorative and becomes accountable.
A pipeline that only works moving forward is not resilient. It is optimistic. Real enterprises need more. They need the ability to correct historical outcomes without corrupting facts, duplicating side effects, or launching manual reconciliation marathons.
The architecture that holds up is opinionated:
- keep business facts immutable
- model around domain events, not just technical events
- separate facts from derived projections
- preserve temporal context where the domain demands it
- run replay through a control plane, not ad hoc scripts
- make reconciliation a first-class capability
- isolate side effects
- migrate progressively with a strangler approach
Kafka and microservices fit well here, but only when used with discipline. Kafka gives you a powerful event backbone. It does not automatically give you domain meaning, historical truth, or safe correction. Those come from architectural choices.
If there is one line to remember, let it be this:
Reprocessing is not about replaying messages. It is about restoring business truth.
That is a different standard. And it is the one worth designing for.
Frequently Asked Questions
What is enterprise architecture?
Enterprise architecture aligns strategy, business processes, applications, and technology in a coherent model. It enables impact analysis, portfolio rationalisation, governance, and transformation planning across the organisation.
How does ArchiMate support architecture practice?
ArchiMate provides a standard language connecting strategy, business operations, applications, and technology. It enables traceability from strategic goals through capabilities and services to infrastructure — making architecture decisions explicit and reviewable.
What tools support enterprise architecture modeling?
The main tools are Sparx Enterprise Architect (ArchiMate, UML, BPMN, SysML), Archi (free, ArchiMate-only), and BiZZdesign. Sparx EA is the most feature-rich, supporting concurrent repositories, automation, scripting, and Jira integration.