Architecture for Reprocessing in Data Pipelines

⏱ 20 min read

Most data platforms look elegant in the first slide deck and feral in production.

On the whiteboard, events flow neatly from producers to streams to warehouses to dashboards. Boxes are clean. Arrows are straight. Everyone nods. Then reality arrives: a bug in enrichment logic, a schema change that was “backward compatible” until it wasn’t, a partner feed that silently dropped 4% of records for six days, or a regulator asking for a corrected position report based on the rules that were valid last Tuesday, not today. Suddenly the question is no longer how data moves. The real question is whether the platform can rethink the past.

That is what reprocessing architecture is about.

Reprocessing is often treated as a technical afterthought: replay some Kafka topics, rerun a batch job, backfill a table, and hope nobody notices. That mindset is expensive. Reprocessing is not merely rerunning code. It is a business capability. It sits at the fault line between domain semantics, operational resilience, and architectural honesty. If your pipeline cannot safely revisit earlier facts and derive corrected outcomes, then your data platform is not robust. It is just fast on good days.

The hard part is this: “reprocessing” means very different things depending on the domain. In retail, it may mean recalculating order margins after tax rules change. In banking, it may mean rebuilding ledger positions from immutable transaction events with the exact reference data that was valid at a point in time. In insurance, it may mean correcting claim eligibility decisions after a policy mapping defect. The business does not care whether the mechanism is replay, backfill, compaction, checkpoint reset, or temporal query. It cares whether the corrected result is accurate, auditable, and explainable.

That is where architecture earns its keep.

This article lays out a pragmatic architecture for reprocessing in data pipelines: not a theoretical ideal, but a shape that survives enterprise gravity. We will look at the forces that make reprocessing difficult, the domain-driven design choices that prevent semantic drift, the role of Kafka and microservices, a progressive strangler migration path, reconciliation as a first-class concern, and the tradeoffs and failure modes that tend to ambush teams who think replaying messages is the same as repairing a business process. event-driven architecture patterns

Context

Modern data pipelines rarely serve one purpose. They support operational decisions, analytics, machine learning features, regulatory reporting, customer communications, and increasingly near-real-time process automation. The same underlying event stream might feed fraud scoring, inventory reservation, customer notifications, and finance.

That multiplicity is exactly why reprocessing becomes painful.

A pipeline that only pushes data forward can ignore many inconvenient truths. It can bake logic into jobs, let schemas drift, depend on mutable reference tables, and rely on side effects hidden inside consumer services. But the moment you need to reprocess historical data, every shortcut becomes visible. Did you preserve the original event? Can you reconstruct the business context at the time? Are downstream actions idempotent? Can you distinguish “corrected output” from “duplicate output”? Which version of the transformation logic should apply: the old one, the new one, or both for comparison?

In enterprise systems, reprocessing usually appears in one of five scenarios:

Bug correction: a transformation or enrichment defect produced wrong outputs.
Late-arriving data: upstream systems delivered records after the business decision window.
Schema or rule evolution: logic changed and historical outputs must be recalculated.
Data recovery: an outage, consumer lag, or corrupted sink left gaps.
Regulatory or audit reconstruction: the business needs to reproduce outputs from historical facts.

Each scenario sounds similar operationally. They are not similar semantically. A replay to recover a failed Elasticsearch index is not the same thing as restating customer commission calculations. Good architecture starts by admitting that “reprocessing” is not one thing.

Problem

Most pipeline estates are built for throughput, not for correction.

Teams optimize for producer decoupling, low-latency streaming, and rapid downstream integration. Kafka topics proliferate. Microservices subscribe. Data lands in lakes, warehouses, search indexes, and caches. Some consumers are stateless. Many are not. Reference data gets read from mutable stores. Decisions become side effects. Over time, the estate becomes a machine that can only move in one direction. microservices architecture diagrams

Then a defect appears.

At that point, the system usually lacks at least one of these essentials:

A durable source of truth for original business events
Versioned transformation logic or at least traceability of rule versions
Temporal reference data to reconstruct historical context
Idempotent consumers so replay does not duplicate effects
Scoped replay mechanisms that can target affected subsets
Reconciliation processes to prove corrected outputs are complete and accurate

Without these, reprocessing becomes a dangerous ritual. Teams reset offsets. Backfill scripts are written under pressure. Data engineers run one-off SQL updates. Operations manually compare counts. Business users discover mismatches days later.

The architecture problem is not simply how to rerun data. It is how to make historical correction safe, bounded, explainable, and cheap enough to use before an incident becomes a crisis.

Forces

Reprocessing sits under several competing forces. This is why simplistic advice tends to fail.

1. Immutability versus business correction

Architects love immutable events because they preserve facts. Business users love corrected outputs because they reflect reality as it should have been interpreted. Both matter. The tension is that facts should remain unchanged, while derived views often must change.

A useful rule: do not mutate facts to repair interpretations. Keep the original event immutable. Create revised derived state.

2. Throughput versus replayability

Optimizing for real-time performance often encourages shortcuts: in-memory joins, mutable lookups, ephemeral state, and “good enough” schemas. These choices speed the happy path and poison the replay path.

Replayability has a cost. You pay in storage, metadata, traceability, and discipline.

3. Technical events versus domain events

Many Kafka topics are not domain events at all. They are integration exhaust: customer-updated-v7, db-change-log, enrichment-output, and other technically convenient but semantically weak messages. These are poor anchors for reprocessing because they often lack business meaning.

Domain-driven design matters here. Reprocessing architecture should be built around business facts and domain invariants, not around whatever happened to be emitted by a connector.

4. Full replay versus targeted repair

A full historical replay sounds elegant. It is often reckless.

Enterprise pipelines are large, shared, and expensive. Replaying everything may overload downstream systems, regenerate already-correct outputs, and create ambiguity around what changed. The practical architecture supports selective reprocessing by time window, entity set, correlation key, defect version, or business case.

5. Accuracy versus operational risk

The more downstream actions a pipeline triggers—notifications, payouts, risk flags, partner calls—the more dangerous replay becomes. Correction may require suppressing side effects, generating compensations, or routing replay outputs down a different lane.

This is why reprocessing is not merely a data concern. It is a process concern.

Solution

The core solution is simple to say and hard to implement well:

Separate immutable business facts from derived projections, and design a controlled reprocessing lane that can rebuild projections from facts using explicit versions of logic and context.

That sentence carries the whole architecture.

A solid reprocessing design has six elements:

Canonical domain event store or event backbone

Preserve business facts in immutable form. Kafka can serve as the transport and short-to-medium retention log, but for enterprise replay horizons you usually also need durable archival storage in object store, lakehouse, or event repository.

Versioned processing stages

Transformations should be deployable and traceable by version. A reprocessing run must know which logic version it used.

Temporal reference data

Joins to customer hierarchies, pricing, policy mappings, FX rates, or product taxonomies should be reconstructable “as of” a point in time, or explicitly version-pinned.

Scoped replay orchestration

Reprocessing is initiated through a control plane, not by ad hoc offset surgery. The control plane defines scope, reason, input set, code version, and output target.

Idempotent or compensating consumers

Downstream systems must either accept duplicate-safe updates or support compensations when corrections invalidate prior effects.

Reconciliation and audit trail

Every reprocessing run needs measurable proof: records considered, records replayed, records corrected, records rejected, business totals before and after, and unresolved exceptions.

When teams miss one of these, they usually compensate with heroics. Heroics do not scale.

Architecture

The architecture that works in practice is usually a dual-lane design: a forward processing lane for normal flow and a reprocessing lane for controlled correction. They share core transformation components but differ in orchestration, isolation, and side-effect handling.

Reference architecture

This is not an event-sourcing manifesto. It is a practical split of responsibilities.

Kafka domain topics carry durable business events for operational integration.
Raw event archive stores those facts for longer replay horizons and bulk backfills.
Forward processing services power normal consumers and projections.
Reprocessing services use the same core business logic where possible, but execute under explicit controls: scoped inputs, pinned reference data, isolated outputs, and reconciliation.
Promotion / Merge moves validated corrected outputs into production views, often via replace, merge-upsert, version switch, or compensation.

Domain semantics first

This is where domain-driven design becomes essential rather than decorative.

If your reprocessing input is a CDC stream of row changes from ten upstream tables, you are starting from plumbing, not domain meaning. You can build a pipeline on that. You cannot build a trustworthy reprocessing architecture on it without pain.

Instead, define bounded contexts and identify the domain facts that matter:

OrderPlaced
PaymentCaptured
ShipmentDispatched
TradeBooked
PremiumCalculated
ClaimRegistered

These are semantically stable anchors. They let you ask business questions like: “Recalculate premium outputs for all policies where rating rule R17 was applied between 1 and 7 March.” That is the right level of abstraction. Reprocessing should be expressible in domain language, not only in partition and offset language.

A memorable line worth keeping: Offsets are not business scope.

Reprocessing control plane

Do not let engineers trigger replay by directly manipulating consumers unless you enjoy incident reviews.

A proper control plane captures:

business reason for reprocessing
affected bounded context
input scope
replay mode: simulation, dry run, corrective run
code and rule version
reference data version or time-travel policy
output destination
side-effect policy
reconciliation checks
approval workflow for sensitive domains

This can be implemented as a workflow service, orchestration job, internal platform API, or even a governed batch framework. The technology matters less than the discipline.

Reconciliation as a first-class capability

Most teams tack reconciliation on at the end. That is backward. Reprocessing without reconciliation is just a more elaborate way to create uncertainty.

Reconciliation must answer at least four questions:

Did we process the intended input set?
Did we produce the expected number of outputs?
Do business aggregates align with source facts and prior state?
Which records remain unresolved and why?

This is where counts alone are dangerous. Enterprise reconciliation needs both technical controls and business controls.

Technical: input count, output count, duplicates, rejects, lag, partition completeness
Business: monetary totals, position sums, policy counts, invoice amounts, entity-level comparisons

A pipeline can pass technical reconciliation and still be wrong in business terms. That is a common failure mode.

Side-effect isolation

One of the oldest sins in microservice estates is mixing state derivation with irreversible side effects. A consumer reads an event, updates a database, sends an email, calls a partner API, and publishes a new event. That seems efficient until replay arrives and everything happens again.

For reprocessing, separate:

state projections that can be rebuilt
business decisions that may need review
external side effects that should be suppressed, deduplicated, or compensated

A replay lane often writes corrected state to staging projections first. Downstream notifications or partner interactions are then emitted only after validation.

Reprocess flow

This pattern gives reprocessing an explicit lifecycle. That matters because correction is often more sensitive than normal processing.

Migration Strategy

Very few enterprises get to design this cleanly from scratch. Most inherit a tangle of batch jobs, Kafka consumers, ETL workflows, warehouse procedures, and service-owned databases. So the migration strategy must be progressive. This is a strangler, not a coup.

Step 1: Identify the true system of fact

Begin by locating the rawest durable representation of the business event that can serve as replay input. Sometimes it is already in Kafka. Sometimes in an object store landing zone. Sometimes in a mainframe extract. Sometimes nowhere useful, which is your first architectural smell.

If the fact is missing, create a raw immutable capture layer before anything else.

Step 2: Separate facts from projections

Document the key derived datasets and classify them:

rebuildable projection
externally committed effect
mixed concern that needs refactoring

Do not attempt universal purity. Triage. Start with the projections that are expensive to repair manually or critical to audit.

Step 3: Introduce replay-safe transformation components

Refactor transformation logic so the same business rules can run in both forward and replay modes. Keep orchestration separate, but avoid duplicating core calculation logic. Otherwise drift between “normal” and “reprocess” code paths will undermine trust.

Step 4: Add temporal semantics to reference data

This is usually the hardest migration step. Many enterprises discover that their biggest replay problem is not event retention but mutable lookup tables. If a customer segment or pricing rule table is overwritten daily, you cannot truthfully reconstruct prior outcomes.

Introduce slowly changing dimensions, bitemporal models, versioned reference snapshots, or evented reference changes where needed. Not everywhere—only where the domain requires reconstruction fidelity.

Step 5: Build the control plane and reconciliation services

Start lightweight if needed. Even a governed job catalog with scoped parameters, approvals, and run metadata is better than shell scripts and tribal memory.

Step 6: Strangle one domain at a time

Pick a bounded context where reprocessing pain is real and business sponsorship exists. Build the new lane there. Prove faster correction, cleaner audit, and safer replay. Then expand.

That is how architectural reform survives contact with enterprise budgets.

Progressive strangler view

Notice the order. We are not starting with a grand platform. We are strangling the riskiest habits first.

Enterprise Example

Consider a global insurance carrier processing policy events across regional systems. New business, endorsements, renewals, cancellations, and claims flow from policy administration platforms into Kafka. Downstream consumers calculate premium, commissions, billing schedules, solvency reporting feeds, and broker statements.

A defect is introduced in the premium calculation service. For seven days, a new rule misclassifies a subset of commercial property policies in Germany and applies the wrong risk loading. The error affects downstream invoices, commission statements, and regulatory exposure reporting.

This is a classic enterprise reprocessing problem because the correction is not simply “rerun the job.”

What goes wrong in a weak architecture

In a weak architecture, teams might:

reset Kafka consumer groups and replay all policy events
rerun premium calculations using today’s reference data, not last week’s
generate duplicate invoice events
trigger broker notifications again
find that some downstream systems consumed old events and some consumed new ones
spend two weeks reconciling financial totals manually

The technology may be modern. The operating model is medieval.

What the stronger architecture does

In a stronger design:

Domain facts exist as immutable events

PolicyBound, CoverageEndorsed, PolicyRenewed, PolicyCancelled are archived with business keys and event timestamps.

Reference data is time-aware

The underwriting factor tables and broker hierarchy mappings are available as-of the original calculation date.

The control plane scopes the run

Only commercial property policies in Germany affected by rule version RISK_LOAD_2026_03_01 between defined timestamps are selected.

Replay writes to corrected projections

Premium outputs are recalculated in an isolated corrected projection table or topic.

Side effects are suppressed

No customer letters, broker emails, or billing exports are sent during replay.

Reconciliation compares business totals

The run verifies policy counts, premium deltas, commission impacts, and exposure totals.

Promotion applies controlled corrections

Billing gets compensating adjustments; broker statements receive corrected deltas; regulatory feed is restated with audit references.

The important thing here is that reprocessing is aligned to the insurance domain. It knows what a corrected premium means, what downstream commitments exist, and which outputs can be replaced versus compensated.

That is the difference between a platform that can replay bytes and one that can repair business truth.

Operational Considerations

Reprocessing architecture lives or dies in operations.

Storage and retention

Kafka retention alone is rarely enough for enterprise correction windows. You usually need tiered storage or archival copies in object storage with partitioning by event date, domain, and perhaps legal entity. Compression is cheap. Missing history is expensive.

Run isolation

Large reprocessing runs can starve normal processing if they share the same clusters, topics, or sink capacity. Isolate with separate consumer groups, separate compute pools, throttling, and back-pressure policies. Correction should not create a second outage.

Observability

You need more than generic pipeline metrics.

Track:

replay input scope and actual input count
processing version and configuration
temporal reference versions used
duplicate detection outcomes
rejection reasons
projection differences before and after
reconciliation status by business dimension

If an operator cannot explain what happened in a replay run within minutes, the architecture is too opaque.

Governance and approvals

In regulated industries, reprocessing can alter financial statements, customer communications, or audit artifacts. Not every replay should be a self-service button. Governance should scale with risk. A search index rebuild does not need the same approvals as a premium restatement. EA governance checklist

Testing strategy

Replay logic must be tested with production-like historical slices, not only synthetic happy-path events. Include:

old schema versions
malformed payloads
partial upstream outages
temporal lookup changes
duplicate and out-of-order events
downstream sink failures mid-run

Reprocessing is where forgotten edge cases line up for revenge.

Tradeoffs

Every architecture choice here buys safety at a cost.

More storage, more metadata, more discipline

Immutable retention, versioned logic, and temporal reference data all increase platform complexity. That is real overhead. Teams who only need transient analytics may not need this machinery.

Slower development in exchange for recoverability

Designing around domain events and replay-safe components requires more careful modeling than shoving CDC into a warehouse and hoping transformations sort it out. It feels slower. In practice it is slower at first and much faster when defects happen.

Greater separation versus more moving parts

A dedicated control plane, reconciliation service, and isolated replay lane introduce more components. Purists may complain. Operations people usually prefer explicit mechanisms to implicit chaos.

Exact historical reconstruction versus pragmatic correction

Sometimes the business needs “recompute using the rules that were valid then.” Sometimes it only needs “recalculate based on the corrected rule and fix current state.” These are different goals. Full temporal fidelity is costly. Do not pay for it where a simpler correction is enough.

A mature architect asks: what kind of truth does the domain require?

Failure Modes

Reprocessing architectures fail in predictable ways. Most are self-inflicted.

1. Replaying technical noise instead of business facts

If your replay source is a swamp of low-level change events, you will reconstruct behavior poorly and inconsistently. Reprocessing should anchor on business facts wherever possible.

2. Using current reference data for historical correction

This is perhaps the most common error. The event is old; the lookup is new; the output is neither historically correct nor operationally consistent.

3. Duplicate side effects

A replay that resends payments, emails, partner messages, or tickets is not a replay. It is a new incident.

4. Hidden non-determinism

If transformations call mutable APIs, rely on wall-clock time, read random configuration, or depend on unordered data access, replay results will drift. Deterministic processing matters.

5. Full replay as the default hammer

The ability to replay everything is seductive. It is also often unnecessary and risky. Broad replays hide root causes and magnify blast radius.

6. Reconciliation by row count only

Equal counts can mask wrong values. Especially with money, balances, and rates, business reconciliation must be explicit.

7. Separate codebases for forward and replay logic

Teams often build a one-off backfill job under pressure. It solves the incident and creates a permanent divergence. Six months later nobody trusts either path.

When Not To Use

Not every pipeline deserves a full reprocessing architecture.

Do not use this pattern when:

the data products are low-value, disposable, or exploratory
outputs are not used for operational or regulated decisions
source systems can regenerate authoritative outputs more cheaply
historical reconstruction has no business value
the cost of temporal reference management exceeds the domain benefit
a simple append-only warehouse backfill solves the real need

For example, if a marketing clickstream powers non-critical experimentation dashboards with a seven-day retention horizon, a heavy replay control plane and temporal dimension strategy may be overengineering. Accept that some defects are corrected forward only.

Architects should know when to stop. Not every shed needs a suspension bridge.

Several adjacent patterns often get confused with reprocessing architecture. They are related, but not identical.

Event sourcing

Event sourcing provides a strong foundation when aggregates are modeled around domain events and state is rebuilt from those events. It helps, but enterprise data estates are broader than a single service’s aggregate store. Reprocessing across analytics, integrations, and shared reference data still needs orchestration and reconciliation.

CQRS

CQRS separates write models from read projections, which aligns nicely with rebuildable projections. But CQRS alone does not solve historical context, side effects, or enterprise replay governance. ArchiMate for governance

Lambda and Kappa architectures

These older stream-processing patterns addressed batch and streaming convergence. They are relevant conceptually, especially for backfills and replay, but often miss the domain semantics and operational controls needed in modern enterprise estates.

Outbox pattern

The outbox pattern helps preserve reliable event publication from transactional services. It strengthens the quality of facts entering the pipeline. That is excellent groundwork for replayability.

Sagas and compensating transactions

When reprocessing invalidates downstream actions, sagas and compensations become essential. Especially in microservices, not every correction is a replace; many are a reverse-and-reissue.

Summary

Reprocessing is where data architecture stops being decorative and becomes accountable.

A pipeline that only works moving forward is not resilient. It is optimistic. Real enterprises need more. They need the ability to correct historical outcomes without corrupting facts, duplicating side effects, or launching manual reconciliation marathons.

The architecture that holds up is opinionated:

keep business facts immutable
model around domain events, not just technical events
separate facts from derived projections
preserve temporal context where the domain demands it
run replay through a control plane, not ad hoc scripts
make reconciliation a first-class capability
isolate side effects
migrate progressively with a strangler approach

Kafka and microservices fit well here, but only when used with discipline. Kafka gives you a powerful event backbone. It does not automatically give you domain meaning, historical truth, or safe correction. Those come from architectural choices.

If there is one line to remember, let it be this:

Reprocessing is not about replaying messages. It is about restoring business truth.

That is a different standard. And it is the one worth designing for.

Frequently Asked Questions

What is enterprise architecture?

Enterprise architecture aligns strategy, business processes, applications, and technology in a coherent model. It enables impact analysis, portfolio rationalisation, governance, and transformation planning across the organisation.

How does ArchiMate support architecture practice?

ArchiMate provides a standard language connecting strategy, business operations, applications, and technology. It enables traceability from strategic goals through capabilities and services to infrastructure — making architecture decisions explicit and reviewable.

What tools support enterprise architecture modeling?

The main tools are Sparx Enterprise Architect (ArchiMate, UML, BPMN, SysML), Archi (free, ArchiMate-only), and BiZZdesign. Sparx EA is the most feature-rich, supporting concurrent repositories, automation, scripting, and Jira integration.