⏱ 20 min read
Most distributed systems do not fail in dramatic ways. They drift. A field gets derived differently in one service than another. A projection lags. A downstream team changes a rule and now last year’s facts need to be interpreted through today’s policy. Then someone says the sentence that always sounds simpler than it is: “Can’t we just replay the events?”
That sentence is the event-sourced equivalent of “we’ll just migrate the database over the weekend.” It hides the hard part in a pleasant verb.
Replay is not one thing. It is several very different architectural moves wearing the same badge. Sometimes it means rebuilding a read model from the beginning of time. Sometimes it means backfilling a new bounded context from an old stream. Sometimes it means correcting a broken projection, re-evaluating domain logic after a policy change, or reconciling Kafka topics after message loss. Each of those has different risks, different semantics, and different operational blast radii. event-driven architecture patterns
This is where teams go wrong. They treat replay as a technical capability, when in practice it is a domain decision wrapped in infrastructure. The log may be immutable, but the meaning of events is not. A replay of OrderPlaced from 2019 into a pricing model from 2026 is not a neutral act. It is interpretation. And interpretation belongs in the domain.
So this article takes a hard line: event replay is architecture, not plumbing. If you run event-sourced systems at enterprise scale, you need explicit replay strategies, not a hopeful script and a maintenance window.
Context
Event sourcing gives you a powerful idea: store the sequence of facts that happened to an aggregate, and derive current state from those facts. It replaces “what is true now?” with “what happened, in what order, and what does that imply?” This is a natural fit for domains where history matters: payments, trading, insurance claims, orders, loyalty programs, inventory reservations, healthcare workflows, and any place where auditability is not a feature but a survival requirement.
In practice, event-sourced systems rarely live alone. They sit inside a landscape of microservices, Kafka topics, operational data stores, search indexes, caches, BI pipelines, data lakes, and old systems that should have retired years ago but still fund the company. The event stream becomes both the backbone of behavior and the fuel for a broad integration topology. microservices architecture diagrams
That is when replay stops being a developer convenience and becomes an enterprise concern.
A read projection fails and needs rebuilding. A new fraud service must be bootstrapped from historical card authorization events. A compliance rule changes and claim eligibility must be recalculated. A downstream Kafka consumer group lost offsets after a deployment and now has partial materialization. A team wants to strangle a legacy order management platform by feeding a new bounded context from old business events. Every one of these requires replay, but not the same replay.
The architectural mistake is to collapse them into one pattern.
Problem
The core problem is straightforward to describe and difficult to solve well:
How do we reprocess historical events to rebuild, repair, migrate, or reinterpret system state without violating domain semantics, operational stability, or trust in the resulting data?
That problem has several dimensions.
First, event streams are temporal. The order of events matters, causation often matters, and gaps or duplication matter. A replay that ignores timeline semantics is not a replay. It is fiction.
Second, events encode business meaning at the time they were produced. Domain events are not raw data dumps. PolicyIssued, InvoiceCancelled, ReservationExpired, PaymentCaptured — these names carry business intent. If event schemas evolve, the meaning can stretch, narrow, or split. Replaying old events through new handlers is therefore a semantic act, not just a computational one.
Third, enterprise estates are heterogeneous. Some services are event-sourced. Others are CRUD systems emitting integration events. Some consume Kafka with at-least-once delivery. Some maintain denormalized PostgreSQL read models. Some write to Elasticsearch. Some expose APIs used by humans who do not care that your projection is currently rebuilding.
Finally, the scale hurts. Replaying ten thousand events in a local environment is a test. Replaying twelve billion events across regions while preserving service health is architecture.
Forces
A good replay strategy lives under pressure from competing forces.
1. Domain correctness versus operational speed
The business wants a repaired projection now. But “now” often encourages shortcuts: skipping old event versions, disabling invariants, parallelizing streams that should be ordered, or replaying integration events where domain events are required. You can get a quick answer and still get the wrong answer.
2. Historical truth versus current policy
Some replays should reconstruct what the system would have known then. Others should apply today’s rules to yesterday’s facts. Those are not equivalent.
For example, an insurer recalculating claim reserves after a regulatory rule change is not rebuilding the original state. It is running historical events through new policy logic. That is a legitimate business process, but it must be named as such. Call it recomputation or re-evaluation, not simple replay.
3. Ordering guarantees versus throughput
Kafka partitions help with scale, but they also define your ordering boundaries. Within a partition you may have order; across partitions you generally do not. Aggregate-level replay can often parallelize safely. Cross-aggregate process managers often cannot. Throughput is tempting. Broken causality is expensive.
4. Isolation versus cost
A replay can run in-place against production infrastructure, against shadow stores, or in a separate environment. More isolation means less production risk but more infrastructure, more data movement, and more reconciliation work.
5. Evolution versus compatibility
Long-lived event stores accumulate schema versions, semantic changes, and regrettable names. Some old events need upcasting. Some should be left untouched and handled through version-aware consumers. Some should never be replayed into certain new contexts because they reflect obsolete process structures.
6. Auditability versus convenience
In regulated domains, you need to explain not only the final rebuilt state but how it was obtained. “We reran the stream with a script” does not satisfy auditors, operations, or sensible architects.
Solution
The most reliable approach is to treat replay as a set of explicit strategies, each aligned to a clear domain purpose.
I usually group replay into five categories:
- Projection rebuild replay
Reconstruct read models from a canonical event store.
Scope: views, search indexes, reporting tables.
Goal: deterministic regeneration of derived state.
- Selective repair replay
Reprocess a bounded slice of events for one projection, tenant, aggregate range, or time window.
Goal: targeted correction with limited blast radius.
- Bootstrap replay
Feed a new service, bounded context, or data product from historical streams.
Goal: initialize a new capability without dual-writing from day one.
- Semantic recomputation replay
Re-evaluate historical facts under new rules.
Goal: produce a new interpretation, not recreate the original one.
- Migration replay
Use historical events to progressively strangle legacy systems and move responsibilities into new services.
Goal: controlled transition, often with reconciliation and coexistence.
That categorization matters because it drives architecture. Projection rebuilds want determinism and repeatability. Semantic recomputation wants versioned policy engines. Migration replay wants anti-corruption layers, idempotent consumers, and long reconciliation windows. Selective repair wants surgical filters and strong observability.
The principle underneath all of them is simple:
Replay should be intentional, version-aware, and externally observable.
Not hidden in a consumer restart. Not mixed with live traffic without controls. Not performed without a reconciliation story.
Architecture
At architecture level, a replay-capable event-sourced platform usually has four distinct concerns:
- Canonical event history
- Replay orchestration
- Version and semantic adaptation
- Target materialization and reconciliation
That separation is worth defending. If your Kafka topic is both your live integration bus and your only historical replay source, you may get convenience, but you also inherit retention limits, compaction concerns, partitioning constraints, and operational coupling that make serious replay awkward. In many enterprises, the durable event store and the streaming fabric are related but not identical.
A common pattern is:
- Aggregate events are written to an event store or append-only event log.
- Domain events are published to Kafka for downstream integration.
- Read models consume from either the event store replay API or Kafka, depending on consistency and retention needs.
- A replay orchestrator coordinates batches, checkpoints, throttling, backpressure, and target cutover.
- Reconciliation services compare rebuilt state with live state and surface drift.
Here is the conceptual shape.
Canonical event history
Your replay source must preserve business facts at the right granularity. That means domain events, not arbitrary technical messages. If what you retained was only integration events tailored for external consumers, replay will be compromised. Integration events are often flattened, redacted, enriched, or denormalized for transport. They are useful, but they are not always sufficient to rebuild domain state.
This is where domain-driven design earns its keep. Inside a bounded context, the event model must reflect the ubiquitous language of that domain. Replaying ShipmentDispatched means something. Replaying status=4 does not.
Replay orchestration
A replay orchestrator is not glamorous, but it is where adults live. It manages:
- replay scope
- source offsets or event positions
- target checkpoints
- idempotency keys
- concurrency limits
- backpressure
- pause/resume
- failure recovery
- cutover controls
- audit trails
Without orchestration, replay becomes a shell script with optimism in it.
Version and semantic adaptation
Events evolve. Fields are added, meanings split, invariants change. You need a strategy for adapting historical events during replay.
There are two broad techniques:
- Upcasting: transform old event versions into the shape expected by current consumers.
- Version-aware handlers: consumers explicitly handle multiple versions.
Upcasting centralizes compatibility but can hide semantic drift if abused. Version-aware handlers are explicit but create branching logic. The right choice depends on how deep the meaning has changed. Structural tweaks are good upcasting candidates. Real semantic changes usually deserve explicit handling.
Target materialization and reconciliation
Rebuilding a read model is easy to describe: consume all events, write the view. The hard part is trusting the result. That is why replay architectures need reconciliation by design.
Reconciliation compares expected and actual outcomes. It may operate at several levels:
- aggregate counts
- event position parity
- checksums or hashes per partition or tenant
- domain invariants
- sampled record-level comparisons
- financial totals
- business KPI drift
This is especially important in strangler migrations, where old and new systems coexist for months.
Replay timeline patterns
The timeline is not decoration. It is the whole game.
A replay usually moves through four temporal zones:
- historical catch-up
- near-live catch-up
- reconciliation window
- cutover to live processing
That is the safest shape for enterprise use because it avoids the fantasy that you can instantly replace one materialization with another.
Notice the uncomfortable but necessary middle: the reconciliation window. Teams want to skip it because it delays the launch. Then they spend six weeks explaining unexplained differences to finance or operations. Reconciliation is cheaper.
Migration Strategy
Replay becomes especially important in migration, because migration is really an argument about time. The old system knows the past. The new system wants the future. Replay is how you negotiate custody.
The best migration pattern here is usually progressive strangler migration, not big-bang replacement.
Step 1: Find the bounded context seam
Do not start by asking, “How do we move all events?” Start by asking, “Which domain capability are we extracting?” A new returns service, claims adjudication engine, payment authorization boundary, or customer preference context is a bounded context decision before it is an event pipeline decision.
If the seam is wrong, replay just moves confusion faster.
Step 2: Build an anti-corruption layer
Legacy systems rarely emit domain-quality events. They emit codes, table changes, and side effects disguised as facts. An anti-corruption layer translates legacy changes into meaningful events for the new bounded context.
This is not ceremony. It protects the language of the new model.
Step 3: Bootstrap with historical replay
Once the new context has a coherent event model, bootstrap it from the legacy history or derived event stream. This creates initial state and lets the new model begin learning from real business history.
Step 4: Run dual processing with reconciliation
For a period, the legacy and new services both process relevant transactions. Differences are measured, classified, and resolved. Some differences are bugs. Some reveal hidden legacy rules. Some reveal that two departments thought the same word meant different things. This is why replay belongs near DDD: semantics surface under pressure.
Step 5: Cut over gradually
Route selected commands or tenants to the new service first. Continue replay or event tailing for anything still mastered by the old system. Increase responsibility in slices. Kill dual writes wherever possible; prefer one source of truth plus event propagation.
Here is the migration shape.
Why progressive strangler works
Because migrations fail at the semantic edges, not the interface edges. A big-bang migration assumes complete understanding up front. Enterprises almost never have that luxury. Replay plus strangler migration creates room for discovery.
Reconciliation in migration
You need explicit reconciliation dimensions. For example:
- order totals by day and currency
- claim status distributions
- shipment state counts
- unpaid invoice balances
- customer-level entitlement outcomes
- exception queues and manual work rates
Reconciliation should not just compare records. It should compare business meaning.
Enterprise Example
Consider a global insurer modernizing its claims platform.
The legacy system is a mainframe-backed claims administration suite. It stores claim state in large relational tables and emits nightly extracts plus some near-real-time Kafka integration messages. The firm wants to carve out a new Claims Assessment bounded context to support digital triage, fraud scoring, and dynamic reserve calculation.
At first glance, this looks like a straightforward event-streaming problem. It is not.
The legacy messages include things like “claim status changed,” “payment posted,” and “reserve adjusted,” but these are not sufficient domain events for the new context. The new team needs richer semantics: ClaimSubmitted, DocumentationRequested, MedicalReviewCompleted, LiabilityAccepted, ReservePolicyApplied, AssessmentEscalated.
So the architects introduce an anti-corruption layer that reads legacy transactions, extracts process history, and emits canonical claim-domain events into a durable event stream. Historical extracts are transformed into event sequences per claim. New live changes are translated as they occur and published through Kafka for downstream consumers.
Then they use replay in three ways:
- Bootstrap replay builds the new claims assessment state from seven years of historical claim events.
- Projection replay constructs read models for adjuster dashboards and fraud work queues.
- Semantic recomputation recalculates reserve recommendations using updated risk models and policy logic.
The first cut fails in a very normal enterprise way.
The replay technically succeeds, but reconciliation shows 4.2% of open claims have reserve recommendations different from the legacy platform. At first, leadership assumes bugs in the new engine. After investigation, three categories emerge:
- genuine defects in event translation
- missing legacy business rules embedded in a COBOL batch process
- intentional differences where the new reserve policy is supposed to produce better outcomes
This is exactly why replay should never be treated as “just rerun history.” It is where hidden rules crawl out of the walls.
The migration proceeds by region. APAC claims are cut over first because product complexity is lower. EMEA follows after new regulatory interpretations are encoded. North America goes last because workers’ compensation has ugly historical edge cases and heavy manual workflows. During each phase, Kafka topics feed both old and new consumers, but the new bounded context is sourced from canonical claim events, not directly from the legacy topic. That separation proves critical when retention on one Kafka cluster is found to be too short for a full rebootstrap after an environment incident.
By the end, the new claims assessment service owns its aggregate history, exposes event-sourced audit trails to compliance, and supports on-demand replay for dashboard rebuilds and targeted recalculation campaigns. The mainframe is still around, but one more expensive domain has been strangled with dignity rather than drama.
Operational Considerations
This is where architecture becomes real.
Idempotency
A replayed event may be delivered more than once. A cutover may overlap with live processing. A failed batch may be retried. If targets are not idempotent, replay is a corruption engine.
Use deterministic event identifiers, target-side deduplication, and version-aware updates. For projections, many teams store the last applied event position per aggregate or partition.
Checkpointing
Checkpoints should be explicit and queryable. You need to know:
- where replay started
- where it is now
- what target version it is building
- whether it can resume safely
- whether it has crossed any cutover thresholds
Do not hide this in consumer offsets alone. Offsets tell part of the story, not the whole one.
Backpressure and throttling
A replay can starve live workloads. Database write amplification, cache churn, Elasticsearch segment pressure, and Kafka broker saturation are all common. Replays need throttle controls and resource isolation.
Snapshot interaction
Snapshots can accelerate aggregate rehydration, but they can also embed old assumptions. If snapshot formats evolve or derive from flawed logic, rebuilding from snapshots may preserve errors. For high-trust replays, many teams prefer periodic snapshot use for speed but maintain the option for full from-origin rebuilds.
Security and data governance
Historical events often outlive original privacy expectations. Replaying them into new systems may violate data minimization rules unless masking, field-level access control, or tokenization is in place. This matters especially when bootstrapping analytics or AI-adjacent services.
Observability
You need replay-specific telemetry:
- events processed per second
- lag to live head
- per-partition/tenant progress
- dead-letter counts
- version adaptation counts
- reconciliation drift metrics
- target write failures
- handler latency distributions
If your monitoring cannot distinguish live consumption from replay consumption, your operators are flying blind.
Tradeoffs
Replay is powerful because it gives you a second chance. That is also its danger. Teams overuse it.
Benefits
- deterministic read model rebuilds
- easier recovery from projection loss
- safer bounded context bootstrap
- stronger auditability
- support for policy re-evaluation
- cleaner migrations through historical continuity
Costs
- substantial infrastructure and storage
- semantic versioning complexity
- long-running operational jobs
- difficult reconciliation
- pressure on downstream systems
- temptation to use historical events beyond their intended meaning
Strategic tradeoff
The big strategic tradeoff is between preserving raw history and preserving usable meaning. Storing every event forever sounds wise. But if events are poorly named, semantically thin, or excessively tied to old process assumptions, long retention alone does not buy safe replay.
Good event models age better than large logs.
Failure Modes
Replay systems fail in very predictable ways. Most are self-inflicted.
1. Replaying integration events as if they were domain events
This is probably the most common architectural sin. Integration events are often lossy. Rebuilding state from them can produce gaps, duplicate derivations, or wrong invariants.
2. Ignoring semantic drift
An old event version may technically deserialize and still mean something different now. Structural compatibility is not semantic compatibility.
3. Parallelizing beyond ordering boundaries
If process correctness depends on causal order and the replay engine fans out indiscriminately, you get subtle corruption that looks like randomness.
4. Mixing replay traffic with live side effects
A replay should not accidentally send emails, re-trigger payments, invoke external APIs, or reopen cases. Side effects must be isolated or disabled. Historical fact processing is not command execution.
5. Weak reconciliation
Teams often compare row counts and declare success. Then finance finds that totals differ, or customer support finds entitlement mismatches. Row counts are comfort, not proof.
6. Kafka retention assumptions
Many teams assume Kafka is their replay history. Then they discover retention is seven days, topics are compacted, or schemas changed without a true archival strategy. Kafka is excellent for streaming. It is not automatically your forever event store.
7. Snapshot trust without validation
If snapshots were built with buggy handlers, replaying from snapshots simply replays your mistake faster.
When Not To Use
Not every system needs replay-friendly event sourcing. This is where architects need restraint.
Do not reach for heavy replay machinery when:
- the domain has little historical or audit value
- state can be fully regenerated from a small authoritative relational model
- business semantics are simple CRUD with minimal temporal logic
- data volumes are modest and point-in-time backups solve the recovery problem
- teams lack the discipline to maintain event contracts over time
- downstream consumers mainly need current-state replication, not historical causality
A simple outbox pattern plus current-state tables is often enough for many operational systems. Event sourcing with rich replay capability is justified when time, history, and explanation are first-class domain concerns.
Likewise, do not use semantic recomputation casually in domains where historical decisions must remain legally tied to the rules in force at the time. In those cases, recomputation may be useful for analysis, but not for authoritative replacement.
Related Patterns
Replay sits in a family of adjacent patterns.
Snapshotting
Useful for performance, especially aggregate rehydration. But snapshots are an optimization, not a substitute for trustworthy event history.
CQRS
Replay often targets query-side models in a CQRS architecture. This is the safest replay use case because projections are derived and replaceable.
Outbox pattern
If your system is not fully event-sourced, the outbox pattern helps publish reliable change events. It supports migration and integration, but it does not magically create replayable domain history.
Anti-corruption layer
Essential in progressive strangler migrations. It preserves domain language when pulling facts out of legacy systems.
Process manager / saga
Be careful replaying across sagas. Historical event reprocessing can accidentally re-trigger orchestration side effects unless compensating controls are in place.
Temporal tables and CDC
In some enterprises, temporal tables and change data capture are used to synthesize event-like histories. This can be practical for migration, but the resulting stream is often weaker in semantics than true domain events. Useful, yes. Equivalent, no.
Summary
Replay in event-sourced systems is one of those ideas that sounds mechanical until you do it for real. Then you discover it is really about meaning, time, and trust.
The event log is not a magic tape recorder. It is a record of domain facts interpreted through bounded contexts, schemas, policies, and operational constraints. Replaying it can rebuild a projection, bootstrap a service, support a strangler migration, or recalculate outcomes under new rules. But those are different acts with different architectural consequences.
The strongest replay strategies share a few traits:
- they start from domain semantics, not infrastructure convenience
- they distinguish projection rebuild from semantic recomputation
- they use progressive strangler migration instead of big-bang rewrites
- they reconcile explicitly before cutover
- they respect ordering boundaries
- they isolate side effects
- they treat Kafka as part of the platform, not the whole memory of the business
If there is one memorable rule here, it is this:
History is easy to store. Meaning is hard to replay.
Architect for the second, not just the first.
Frequently Asked Questions
What is enterprise architecture?
Enterprise architecture aligns strategy, business processes, applications, and technology in a coherent model. It enables impact analysis, portfolio rationalisation, governance, and transformation planning across the organisation.
How does ArchiMate support architecture practice?
ArchiMate provides a standard language connecting strategy, business operations, applications, and technology. It enables traceability from strategic goals through capabilities and services to infrastructure — making architecture decisions explicit and reviewable.
What tools support enterprise architecture modeling?
The main tools are Sparx Enterprise Architect (ArchiMate, UML, BPMN, SysML), Archi (free, ArchiMate-only), and BiZZdesign. Sparx EA is the most feature-rich, supporting concurrent repositories, automation, scripting, and Jira integration.