Streaming Architecture Without Backfill Is Incomplete

⏱ 20 min read

There is a seductively simple story that shows up in too many architecture decks: we will move to streaming, publish domain events, and everything downstream will react in real time. The boxes are clean, the arrows are directional, and the future appears to arrive on a Kafka topic.

Then reality walks in wearing muddy boots.

A new consumer comes online and asks for the last three years of customer lifecycle events. A risk engine changes its scoring model and needs to recompute state for every open account. A data product misses messages during an outage. A regulator asks you to explain how a number was derived on a given day. A microservice team deploys a bug, corrupts projections, and now wants to rebuild them from source-of-truth history. Suddenly the shining streaming platform looks less like a nervous system and more like a radio broadcast: useful if you were listening live, useless if you joined late. microservices architecture diagrams

That is the central truth: a streaming architecture without backfill is incomplete. It may be fast. It may even be elegant. But it is not operationally whole.

Backfill, replay, rehydration, catch-up, historical reprocessing—different organizations use different words. The important point is the same. If your architecture can only move forward from “now,” then it cannot support the messy, very normal behavior of enterprises: changing logic, delayed onboarding, broken consumers, audit reconstruction, migration, and reconciliation. Real businesses do not run in a perfect present tense.

The mature design question is not “should we stream?” It is “what does time mean in this domain, and how do we support both live flow and historical correction?” That is a domain-driven design question as much as an infrastructure one. And that is where many teams go wrong. They obsess over brokers, partitions, and retention settings, while skipping the semantics of what is being replayed, what can be recomputed, and what must never be emitted twice.

A replay diagram belongs in the architecture because replay is not an afterthought. It is a first-class capability.

Context

Streaming architectures have become the default aspiration for modern enterprise systems. The reasons are good ones: decoupling, near-real-time integration, scalable event distribution, and a clean model for reacting to business changes across bounded contexts. Kafka, Pulsar, Kinesis, and similar platforms have made it practical to route domain events from one service to many consumers without creating a nest of point-to-point integrations. event-driven architecture patterns

This pattern works especially well in organizations moving away from batch-heavy integration estates. Instead of nightly jobs moving CSV files between applications, business changes are exposed as streams: OrderPlaced, PaymentCaptured, ClaimOpened, PolicyCancelled, CustomerAddressChanged. Downstream services subscribe and build their own projections, automate decisions, or feed analytics pipelines.

But streaming is often sold with an unspoken assumption: all interested consumers exist at the moment the event is published, remain healthy forever, and interpret the event correctly the first time.

That assumption belongs in fantasy architecture, not enterprise architecture.

Enterprises live with mergers, product launches, regulatory changes, data repair, model recalculation, legacy system decomposition, and platform migrations. They add new consumers after years of history already exist. They discover bugs in event mappings. They move from CRUD integration to event-driven collaboration incrementally, not in one clean rewrite. History matters because the business has memory.

This is why event-driven architecture and domain-driven design need to be connected. Events are not packets on a wire. They are statements about business facts within a bounded context. Their meaning, timing, and reconstructability are part of the design. Once you frame it this way, replay is no longer just a broker feature. It becomes an architectural capability rooted in domain semantics.

Problem

The problem starts when teams equate streaming with state propagation. They publish the latest event and assume downstream systems can infer everything they need. For a while, that works.

Then the first hard question appears: “How does a new consumer obtain historical state?”

If the answer is “from today onward,” the consumer is crippled. If the answer is “from the source database,” then the architecture quietly admits the stream is not sufficient. If the answer is “we will do a one-time extract,” then replay is being handled manually, which means badly.

A deeper problem follows. Not all historical reconstruction is the same:

Backfill may mean loading past business facts into a new stream.
Replay may mean re-reading retained events to rebuild derived state.
Reconciliation may mean comparing projected state against a source-of-truth and repairing drift.
Reprocessing may mean applying new business logic to old facts.
Rehydration may mean rebuilding an aggregate or read model from an event history.

These are related but different. Teams that collapse them into one vague capability usually build something dangerous: either a blunt-force “replay everything” button or a brittle collection of ad hoc scripts.

The result is predictable:

duplicate side effects
broken ordering assumptions
polluted analytics
unbounded reprocessing jobs
production topics reused unsafely
support teams unable to explain what happened

In short, they build a system that can emit events but cannot live with time.

Forces

Several forces pull the architecture in different directions.

1. Real-time responsiveness versus historical completeness

The business wants fresh data now. Operations wants recoverability later. These goals are aligned in principle but often opposed in implementation. Fast pipelines prefer minimal overhead; replayable systems need durable history, stable contracts, and metadata that allows safe recomputation.

2. Domain truth versus integration convenience

A domain event should represent a meaningful business fact, not a random table mutation. But many organizations begin with change data capture or CRUD-style events because they are easy to produce. Those events may be sufficient for simple propagation, yet poor for semantic replay. Reconstructing intent from row-level changes is possible, but ugly.

3. Stateless consumers versus stateful projections

A notification consumer can ignore history. A ledger, inventory view, customer 360, pricing engine, or fraud model cannot. The more domain value a consumer carries, the more replay matters.

4. Retention cost versus audit and recovery needs

Keeping years of events costs money and operational discipline. Not keeping them pushes the burden elsewhere: warehouses, backups, and extraction pipelines. One way or another, history will be paid for.

5. Exactly-once aspirations versus enterprise reality

People love saying “exactly once.” Enterprises live in “at least once, with compensation, idempotency, and scars.” Replay intensifies this truth. If reprocessing old events causes side effects, your design is wrong, not your broker.

6. Incremental migration versus clean-slate purity

Very few firms can replace a monolith or batch estate overnight. They need progressive strangler migration, parallel runs, and reconciliation windows. Backfill is one of the bridges that lets old and new worlds coexist long enough to change safely.

Solution

The solution is straightforward to say and harder to implement well:

Design the streaming architecture as a dual-time system: one path for live event flow, another for controlled historical replay and reconciliation.

That sounds simple. It isn’t. It requires you to decide what is replayable, where history is stored, how consumers distinguish live from replay traffic, and which actions are permitted during reprocessing.

A good architecture typically includes these ideas:

Authoritative event sources or reconstructable change history

You need a durable basis for rebuilding downstream state. This can be event sourcing in some bounded contexts, CDC plus semantic transformation in others, or immutable integration events persisted beyond transient broker retention.

Explicit replay orchestration

Replay should be an operational workflow, not a hidden side effect. It needs scopes, checkpoints, filters, rate limits, observability, and approvals in sensitive domains.

Idempotent consumers and side-effect isolation

A replayed event must not resend customer emails, rebill accounts, or resubmit trades. Derived-state handlers should be separable from command-producing or externally side-effecting handlers.

Reconciliation as a first-class control loop

Replay alone does not guarantee correctness. Projection drift happens. Data repair happens. Legacy coexistence definitely happens. You need comparison and correction mechanisms between source-of-truth state and downstream projections.

Domain-aware event design

The event model must reflect business meaning. Replaying OrderShipped is useful. Replaying orders.status='S' changed is less so. Domain semantics are not decoration; they are what make historical computation trustworthy.

Here is the core shape.

Diagram 1 — Streaming Architecture Without Backfill Is Incomplete

Notice the uncomfortable but necessary detail: the broker is not the whole story. The operational database, outbox, replay orchestrator, and reconciliation engine all matter. In real enterprises, correctness comes from the ensemble, not from Kafka alone.

Architecture

The architecture needs a few sharp boundaries.

Event production

For operational systems, the outbox pattern is usually the sane choice. A service commits domain state and an event record atomically, then publishes asynchronously. This avoids dual-write inconsistencies and gives you a durable event publication log. Where the source is a legacy application that cannot emit domain events cleanly, CDC can help—but only if followed by semantic enrichment, not dumped raw into enterprise topics.

This is where domain-driven design earns its keep. Inside a bounded context, events should represent business facts in the language of the domain. If the claims context emits ClaimAdjudicated, that means something durable. If it emits a dozen low-level field changes, downstream consumers must reverse-engineer intent. That becomes miserable during replay.

Event history and retention

If replay is a real requirement, broker retention cannot be treated casually. Some domains can replay from Kafka if retention is long enough and compaction is appropriate. Others need archived immutable event storage outside the broker: object storage, lakehouse, or event repository. The design depends on scale, retention, regulatory requirements, and whether consumers need original event order or merely reconstructable facts.

A useful rule: retain enough history in the broker for operational recovery; archive enough history outside the broker for enterprise reconstruction.

Consumer classes

Not all consumers should behave the same under replay. Separate them into categories:

Pure projection consumers: build read models, search indexes, aggregates.
Analytical consumers: enrich datasets, feed ML features, produce historical metrics.
Workflow consumers: trigger commands, tasks, notifications, or external calls.
Compliance/audit consumers: preserve immutable records.

Pure projections are ideal replay candidates. Workflow consumers are the dangerous ones. If they consume replayed events blindly, they produce duplicate side effects. So either they opt out of replay, or they operate in a special replay mode that suppresses external actions.

Replay control plane

This is the part many teams omit. They assume consumers can simply rewind offsets. Sometimes they can. Often they should not.

A replay control plane should answer:

which dataset or topic is being replayed?
from what time range or checkpoint?
for which consumers or projections?
at what rate?
with which event version mappings?
with what side-effect suppression rules?
how is progress measured?
how do we stop safely?

This can be a platform service, not a bespoke script collection. If streaming is strategic, replay deserves product thinking.

Replay metadata

Replayable systems need metadata. At minimum:

event id
aggregate or entity id
event time
publication time
source system/version
replay flag or replay context id
causation/correlation ids
schema version

This metadata lets consumers distinguish original flow from replay flow. It also gives operations a fighting chance when diagnosing anomalies.

Reconciliation loop

Replay is how you rebuild. Reconciliation is how you prove you are right.

In enterprise migration and day-2 operations, projected state drifts for many reasons: dropped messages, consumer bugs, schema mishandling, delayed upstream corrections, or manual data fixes in legacy systems. Reconciliation compares source-of-truth snapshots or recalculated state with current downstream projections, then raises discrepancies or triggers repair jobs.

This is particularly important during progressive strangler migration. New microservices often coexist with old systems for months or years. During that time, event-driven synchronization is never perfect enough to skip reconciliation.

Migration Strategy

The clean-slate version of streaming architecture is a luxury most firms do not have. What they have is a monolith, several packaged applications, nightly integration jobs, and a handful of ambitious service teams. That is why progressive strangler migration is the right frame.

Do not ask the enterprise to believe in a magical cutover. Give it a controlled path where streaming and backfill work together.

Step 1: Identify bounded contexts and the real system of record

This is not a technical inventory exercise. It is a domain exercise. In customer management, who owns customer identity, contact preferences, credit status, and account lifecycle? Different bounded contexts may own different facts. Backfill will fail if you do not know which source is authoritative for which concept.

Step 2: Introduce semantic events at the edge of the current system

Often the first move is an outbox or CDC pipeline on the monolith. But resist the urge to publish raw table changes as the strategic event model. Use a translation layer to shape stable domain events. This is where migration gets honest: not every legacy field maps cleanly, and not every old state change deserves to become an enterprise event.

Step 3: Build replayable downstream projections first

Before using events to trigger business-critical workflows, use them to build read models, search indexes, customer timelines, reporting stores, and operational dashboards. These are the safest places to validate event quality, replay mechanics, and reconciliation discipline.

Step 4: Add backfill pipelines for historical load

A new projection is rarely useful with only current-day events. It needs history. Historical extracts should be transformed into the same semantic event model where possible, then loaded through the same consumer logic. This is one of the most underrated migration principles: use one path for logic, even if data enters from different time horizons.

Step 5: Run live + backfill + reconciliation in parallel

This is the hard middle. Historical data loads while live events continue to arrive. You need cut-off markers, deduplication rules, and reconciliation windows. It is messy. It is normal. The architecture should support it, not pretend otherwise.

Step 6: Gradually switch business capabilities

Once projections are stable, move selected workflows to consume event-driven state. Keep side effects isolated and reversible. During this phase, replay policies become critical, because a reset in a projection should not re-execute business commands.

Step 7: Retire legacy integration carefully

Only when reconciliation results are good over time should old batch feeds and direct queries be decommissioned. Strangling is not just rerouting traffic. It is earning confidence.

Enterprise Example

Consider a large insurer modernizing claims processing.

The legacy landscape has a claims core platform, a customer master, a payments engine, a document management system, and a reporting warehouse refreshed nightly. The firm wants to introduce microservices for claims intake, fraud scoring, adjuster work allocation, and customer notifications. Kafka is chosen as the event backbone.

The first design draft looks familiar: publish claim events from the claims platform, let downstream services react, and build a real-time claim timeline. Everyone is pleased until the awkward questions arrive.

How will the fraud service train and score against five years of open and closed claims?
How will the new adjuster workload service initialize its queues?
How will the notification service avoid resending messages during projection rebuilds?
How will the firm prove that the new claim timeline matches the legal record during coexistence?
What happens when the event mapping for reserve changes is corrected after three months?

This is exactly where immature streaming programs stumble.

A stronger design emerges.

The claims platform emits an outbox of semantic events such as ClaimOpened, ReserveAdjusted, DocumentReceived, CoverageVerified, ClaimClosed. Historical claim records are extracted from the legacy database and transformed into those same semantic event types where feasible. Both live and historical events feed a replayable projection pipeline.

The fraud scoring service consumes live claim events but maintains a separate historical feature build process. It can replay claim history to rebuild features without triggering operational fraud alerts. The notification service listens only to a curated operational topic and rejects replay-tagged events. The claim timeline UI is built from replayable projections and reconciled nightly against authoritative claim snapshots. During migration, adjuster assignment decisions are compared between old and new logic for several weeks before cutover.

That architecture is not as pretty on a slide as “Kafka connects everything.” But it survives contact with an insurer.

The deeper lesson is domain semantics. In claims, the difference between “field changed” and “reserve adjusted” matters. Replaying the latter preserves business meaning; replaying the former can produce nonsense if field history is incomplete or order-sensitive. Domain events are what make historical reasoning possible.

Operational Considerations

Replay changes operations. It introduces controlled risk in exchange for recoverability. That trade is worth making, but only if done deliberately.

Observability

You need separate visibility for live lag and replay lag. You need to know which consumers are in replay mode, what checkpoint they are at, and whether they are suppressing side effects correctly. A good dashboard shows:

event throughput by mode
replay progress by time range/entity range
projection freshness
reconciliation mismatch counts
duplicate detection rates
dead-letter volumes
schema version distribution

Capacity management

Backfills and replays compete with live traffic. This is a common failure mode. Enterprises launch a large replay job, saturate consumers or broker IO, and then wonder why live SLAs collapse. Rate limiting, isolated replay topics, dedicated consumer groups, and off-peak scheduling are not optional.

Schema evolution

Historical replay exposes every lazy schema decision. A consumer built only for the current event shape may fail on older versions. You need explicit compatibility strategy: upcasters, versioned handlers, or historical transformation during replay.

Data retention and privacy

Replay creates tension with deletion requirements. Some domains can keep immutable event history for years; others must redact or tombstone personal data. This is where legal, compliance, and architecture must work together. “We need replay” does not override privacy obligations.

Access control

Historical replay can expose sensitive business history at scale. Operations tooling needs authorization, approval workflows, and audit trails. In regulated firms, replay requests may need change controls.

Testing

Most teams test the happy live path and neglect historical rebuild scenarios. That is a mistake. You need game days for:

projection wipe and rebuild
partial replay with live traffic ongoing
consumer rollback after bad deployment
schema upgrade with old event versions present
reconciliation mismatch repair

Tradeoffs

There is no free lunch here. Replayability buys resilience and migration safety, but it costs.

More storage.

You keep more history, in more places, for longer.

More complexity.

A replay control plane, consumer modes, reconciliation loops, and schema evolution strategy all add moving parts.

More discipline.

Teams must distinguish pure computation from side effects, and many teams are not used to that separation.

More upfront design pressure.

Event semantics have to be thought through. Sloppy event design gets very expensive later.

But the alternative cost is usually hidden, not absent. Without replay and backfill, organizations pay through manual scripts, one-off data loads, brittle migrations, support firefighting, and endless “can we trust this number?” meetings. That is not simpler. It is merely undocumented complexity.

My bias is clear: if streaming is strategic, build replay in from the start. If streaming is tactical and shallow, do not pretend otherwise.

Failure Modes

Some failures show up repeatedly.

Replaying events that trigger external side effects

This is the classic sin. A consumer sends emails, posts payments, or opens tickets on every event. Then someone replays six months of history. Chaos follows. The fix is architectural: side effects must be isolated or guarded, not hoped away.

Treating CDC as a domain event model

Raw table changes often lack stable meaning, especially during historical reconstruction. A replay built on low-level mutations can produce inconsistent results if ordering or omitted fields are misunderstood.

Assuming broker retention equals enterprise history

Kafka retention is not a records strategy. It is an operational setting. If the business needs multi-year rebuilds, design for archival and retrieval explicitly.

No reconciliation, only faith

Teams replay data into projections and assume the results are correct. They are often not. Reconciliation is how confidence is earned.

Bulk backfill without cutover boundaries

Historical load and live flow overlap. If you do not define cut-off timestamps, entity ownership windows, or dedupe keys, you get duplicate or missing state.

Event version drift

Old events stop being readable. New consumers only understand current versions. Replay then becomes impossible precisely when needed most.

When Not To Use

Not every streaming architecture needs heavy replay machinery.

Do not overbuild this pattern when:

the domain is low-value, low-history, and tolerant of simple re-extraction
consumers are purely ephemeral notifications with no stateful rebuild requirement
source systems can be queried cheaply and safely for current state, and historical reconstruction is irrelevant
the organization lacks the operational maturity to run streaming safely at all

There is a real “when not to use” here. If your integration needs are modest and a daily batch is sufficient, forcing event replay architecture everywhere is cargo cult. Likewise, if the source domain has weak semantics and no appetite to improve them, publishing flimsy pseudo-events into Kafka will not magically create a robust event-driven enterprise.

A blunt opinion: if you cannot define the business meaning of an event and the safe behavior of a replay, you are not ready for strategic streaming in that domain.

Several patterns sit close to this one.

Outbox Pattern: reliable event publication from transactional systems.
Change Data Capture: useful for extraction, but usually needs semantic shaping.
Event Sourcing: strongest replay story inside a bounded context, but not required everywhere.
CQRS: projections and read models are natural replay consumers.
Strangler Fig Pattern: essential for progressive migration from legacy systems.
Materialized Views: common targets for replay and rebuild.
Idempotent Consumer: table stakes for any replay-capable architecture.
Dead Letter Queues / Parking Lots: helpful, but not substitutes for replay design.
Data Reconciliation Pipelines: the practical companion to migration and replay.

These patterns are not interchangeable. Event sourcing, for example, gives excellent rehydration but may be overkill for many enterprise domains. CDC helps migrate legacy systems but often yields poor domain semantics if left unrefined. The architect’s job is to mix them with intent, not decorate a diagram with fashionable terms.

Summary

Streaming architecture is powerful because it gives enterprises a way to represent business change as flow. But flow alone is not enough. Enterprises also need memory, correction, reconstruction, and controlled migration. They need to bring new consumers into an old world. They need to recover from bugs. They need to explain numbers. They need to rebuild projections without replaying business harm.

That is why a streaming architecture without backfill is incomplete.

The right design treats replay as a first-class capability, not a maintenance trick. It grounds event design in domain-driven semantics. It separates pure projections from side-effecting workflows. It introduces reconciliation as a control loop. It supports progressive strangler migration rather than pretending legacy disappears on command. And it accepts a simple operational truth: systems fail, logic changes, and history has to be processed again.

In enterprise architecture, the test of a system is not whether it works on the first pass. The test is whether it can survive the second, third, and tenth pass when the organization changes its mind, its models, or its structure.

Live streaming is the headline.

Replay is the architecture.

Frequently Asked Questions

What is event-driven architecture?

Event-driven architecture (EDA) decouples services by having producers publish events to a broker like Kafka, while consumers subscribe independently. This reduces direct coupling, improves resilience, and allows new consumers to be added without modifying producers.

When should you use Kafka vs a message queue?

Use Kafka when you need event replay, high throughput, long retention, or multiple independent consumers reading the same stream. Use a traditional message queue (RabbitMQ, SQS) when you need simple point-to-point delivery, low latency, or complex routing logic per message.

How do you model event-driven architecture in ArchiMate?

In ArchiMate, the Kafka broker is a Technology Service or Application Component. Topics are Data Objects or Application Services. Producer/consumer services are Application Components connected via Flow relationships. This makes the event topology explicit and queryable.