State Propagation Strategies in Distributed Systems

⏱ 22 min read

Distributed systems don’t fail because machines are unreliable. They fail because meaning leaks between boundaries.

That’s the real problem.

We like to pretend the hard part is transport: queues, brokers, retries, HTTP timeouts, Kafka partitions, consumer lag. Those things matter, of course. But most enterprise pain around state propagation starts one layer above the plumbing. A system changes. Another system needs to know. A third system needs to react. A fourth system must reconcile because it heard something late, twice, or not at all. Suddenly what looked like “move data from A to B” becomes a long argument about truth, timing, and ownership.

This is why push versus pull is not a minor implementation choice. It is an architectural decision about how knowledge moves through a landscape of bounded contexts.

A push model says: “When I change, I will tell you.”

A pull model says: “When I need to know, I will ask.”

Most real enterprises end up with both, whether they planned for it or not.

And that is the first opinion worth stating plainly: there is no universally correct state propagation strategy. There are only tradeoffs, and they are deeply tied to domain semantics. If you ignore the domain, you will build a fast, elegant, disastrously wrong integration architecture.

This article looks at state propagation in distributed systems through a practical enterprise lens: domain-driven design, event-driven architecture, Kafka-centric integration, microservices, reconciliation, migration, and the ugly operational truth. We’ll look at push and pull patterns, where each fits, how they fail, and how to evolve from one to the other without detonating the business. event-driven architecture patterns

Context

Every enterprise has state scattered across systems.

An order is placed in a commerce platform. Inventory is reserved in a warehouse system. Credit is checked in finance. Shipment status changes in logistics. Customer preferences sit in CRM. Product attributes live in PIM. None of these systems are merely databases. They are decision engines with different models, different language, different rhythms.

That distinction matters.

In domain-driven design terms, each of these systems often represents a bounded context. “Order,” “customer,” “availability,” and “invoice” may all sound like shared nouns, but they do not mean the same thing everywhere. A customer in marketing is a profile. A customer in billing is a legal entity. Inventory in e-commerce is a promise. Inventory in warehouse operations is a physical count. State propagation is therefore not simply data replication. It is semantic translation under temporal uncertainty.

Enterprises often discover this the hard way during modernization.

A monolith gets decomposed into services. A central database gets replaced with APIs and event streams. Kafka appears. Teams celebrate decoupling for six months, then spend the next two years chasing consistency bugs. Why? Because the monolith hid many state transitions inside one transaction boundary. Once split apart, those same transitions become propagation problems.

Push and pull are the two primitive moves in this game.

In push, the producer emits state change notifications or events to downstream consumers.
In pull, consumers retrieve state from a source system when needed, on a schedule, or after receiving a lightweight trigger.

That sounds simple. It isn’t. The details determine whether your architecture is resilient or brittle.

Problem

How should state changes propagate across distributed systems so that downstream capabilities remain correct enough, fresh enough, and cheap enough to operate?

That single sentence carries four conflicting concerns:

Correctness

Did downstream systems get the right state, in the right order, with the right meaning?

Freshness

How quickly does a change become visible elsewhere?

Cost

What are the infrastructure, coupling, and development costs of making that happen?

Operability

Can humans diagnose and recover the system when propagation inevitably breaks?

The trap is that teams often optimize for one and accidentally damage the others.

A push-heavy design can deliver low latency but create tight semantic coupling and replay nightmares. A pull-heavy design can reduce coupling but overload source systems and produce stale decisions. A naive hybrid can combine the worst of both.

The core question is not “push or pull?” It is:

For this domain capability, who should own truth, who should know about change, how stale can data be, and how do we recover when the propagation path lies?

Forces

State propagation sits in the middle of several architectural forces. Ignore any of them and the design will look elegant in diagrams and ugly in production.

1. Domain ownership

The first force is ownership. Every important business fact should have a clear system of record. That does not mean only one system stores a copy. It means one bounded context owns the business rules that decide what that fact means.

If pricing owns “current sell price,” then other services may cache or project it, but they should not redefine it casually. If order management owns “order acceptance,” inventory cannot silently invent orders through side effects.

Push and pull both behave differently depending on whether the source is authoritative or derivative.

2. Latency tolerance

Some business processes can tolerate minutes or hours of delay. Others cannot.

Fraud scoring, payment authorization, and warehouse allocation often need near-real-time updates. Executive reporting, customer segmentation, and search indexing can usually tolerate eventual consistency.

Architects get into trouble when they use the same propagation strategy for both categories because “standardization” sounded prudent.

3. Read/write asymmetry

Many domains are read-heavy. Product catalog, pricing reference data, and customer preference lookup are often read far more frequently than they change. Other domains are change-heavy: clickstreams, telemetry, trading signals, and logistics updates.

Push models tend to shine when there are many interested consumers and state changes carry business significance. Pull models often work better when consumers need selective access to relatively stable information.

4. Coupling pressure

Push can reduce runtime dependency on the source system, but it increases coupling to the event contract and state interpretation. Pull can reduce semantic coupling if consumers ask for exactly what they need, but it increases runtime dependency on the source API or database view.

This is a subtle but important tradeoff. Teams often say push is “more decoupled.” That is only half true. It decouples invocation timing, not meaning.

5. Consistency and replay

If consumers can miss updates, join late, or require rebuilding, the architecture needs a replay story. Kafka helps here because event logs are durable and replayable. But replay only works if events are well-designed, retained long enough, and interpretable without hidden state.

Pull has a different replay advantage: if the source system still holds truth, consumers can re-fetch current state. But that helps only for current state, not historical transitions.

6. Scale and blast radius

Push fans out changes to many consumers efficiently, especially through brokers. Pull spreads load across time but can create thundering herds, polling storms, or API hotspots.

In enterprises, source systems are often old, fragile, and expensive to scale. Architects who prescribe aggressive pull against a decades-old ERP usually learn humility fast.

Solution

The practical answer is not “choose push” or “choose pull.” It is to select among three broad propagation styles:

Pure push
Pure pull
Hybrid push-trigger / pull-fetch

Most successful enterprise architectures use the third more than they admit.

Push model

In a push strategy, the source emits events or messages when state changes. Consumers subscribe and update their own projections, caches, or processes.

Typical implementations:

Kafka topics with domain events
Change data capture (CDC) into event streams
Webhooks
Message brokers such as RabbitMQ, SNS/SQS, JMS

Push is best when:

many consumers need to know about changes
low latency matters
the source system should not be called repeatedly
downstream systems need event history, not just current state
domain events represent meaningful business transitions

Push is at its best when events speak domain language: OrderPlaced, InventoryReserved, PaymentAuthorized, not ROW_UPDATED_CUSTOMER_TABLE.

A table-change event is motion without meaning.

Pull model

In a pull strategy, consumers request current state from the source on demand or on a schedule.

Typical implementations:

synchronous API calls
periodic polling
batch extracts
query federation
materialized snapshots fetched by consumers

Pull is best when:

consumers need only occasional access
current state matters more than transition history
changes are infrequent
selective retrieval is important
source ownership must stay explicit and centralized

Pull is often the right answer for master data lookups, reference data, and workflows where stale data can be tolerated.

Hybrid push-trigger / pull-fetch

This is the enterprise workhorse.

The source pushes a lightweight notification, often containing identity and version metadata. Consumers then pull full state when needed. This pattern balances low-latency awareness with controlled data retrieval and simpler event contracts.

Example:

Product service emits ProductChanged(productId, version)
Search indexing service receives it and fetches the latest searchable product view
Recommendation engine may ignore some changes or fetch only for certain categories

This avoids putting every field into every event while still preventing blind polling.

It is not free. Consumers now depend on both the event stream and the source retrieval interface. But the trade is often worth it.

Architecture

A useful way to think about propagation is to separate three concerns:

state authority
change notification
state retrieval / projection

When architects collapse these into one mechanism, complexity tends to leak everywhere.

In this architecture, Kafka carries change intent or state change signals. Consumers build local projections. Some consumers may still pull the source for enrichment or reconciliation.

That last part matters because event-driven architecture does not eliminate the need for queries. It changes where and when they happen.

Domain semantics first

Under DDD, don’t start with transport. Start with the domain model.

Ask:

What event actually occurred in business terms?
Is downstream interested in a transition, a snapshot, or a derived decision?
Does the consumer need the whole aggregate or only a subset?
Is ordering required within an aggregate, across aggregates, or not at all?
Is the consumer acting on authoritative truth or building a read model?

For example, CustomerAddressChanged may be a meaningful event for shipping, tax, and communications. But the “address” each consumer needs may differ. Shipping wants deliverability fields. Tax cares about jurisdiction. Marketing may care only about region. This is why giant canonical events so often become junk drawers.

If one event tries to satisfy every consumer forever, it soon satisfies none of them well.

Event-carried state transfer vs notification

Push designs come in two variants:

Event-carried state transfer

The event contains enough state for consumers to update themselves.

Notification event

The event says something changed and identifies what changed; consumers fetch details if needed.

Event-carried state is great for autonomy and replayable projections. But it creates larger payloads, broader schema commitments, and pressure toward canonical models. Notification events are smaller and more stable, but they shift complexity to pull paths.

Again, the right answer depends on the domain and consumption pattern.

API composition and pull

Pull does not have to mean naive direct calls all over the estate. Good architectures shape pull behind fit-for-purpose interfaces:

read APIs for bounded contexts
anti-corruption layers
query services
cached materialized views
replicated read stores

If ten services are polling a transactional ERP every 30 seconds, the architecture is not “simple.” It is deferred failure.

Reconciliation as a first-class concern

Any nontrivial state propagation mechanism needs reconciliation. Not optional. Not “phase two.” First-class.

Reconciliation answers the awkward question: what if the consumer’s state diverges from the source?

Reasons for divergence include:

missed events
consumer downtime
poison messages
schema evolution bugs
duplicate processing
out-of-order delivery
source correction after original publication

Reconciliation can be:

periodic snapshot comparison
version checks
compensating events
replay from Kafka
pull-based rehydration
audit reports for manual resolution

The most robust architectures treat real-time propagation as the fast path and reconciliation as the safety net.

Diagram 2 — Reconciliation as a first-class concern

This is not glamorous architecture. It is survivable architecture.

Migration Strategy

Most enterprises are not starting from a greenfield landscape. They have monoliths, packaged applications, brittle point-to-point integrations, nightly batches, and reporting stores pretending to be operational interfaces.

So migration matters.

The right migration strategy is usually a progressive strangler, not a heroic rewrite.

Start with observation, not replacement

Before choosing push or pull, map current state movement:

where does truth live today?
what downstream decisions depend on it?
what latency is actually required?
what hidden batch jobs already reconcile state?
where are semantic mismatches causing defects?

This mapping often reveals that the current system already uses a messy hybrid model. Naming it clearly is the first step toward improving it.

Introduce event streams at seams

A practical path is to add event publication around existing systems without immediately forcing all consumers to become event-native.

Common seam options:

CDC from legacy databases into Kafka
domain event publication from the monolith
webhook facade in front of legacy workflows
integration service translating internal changes into external events

This creates a backbone for new consumers while old consumers continue using pull or batch.

Use pull as a migration bridge

During strangler migration, pull is often the bridge that keeps risk under control.

Example:

Legacy order system remains source of record
New fulfillment service subscribes to OrderChanged
For fields not yet included or trusted in events, fulfillment pulls current order state from an anti-corruption API
Over time, event fidelity improves and pull reduces

This allows gradual tightening of event contracts instead of betting the business on perfect event design from day one.

Move from generic integration events to domain events

CDC is useful for bootstrapping but dangerous as an endpoint. Database changes expose implementation details, not business intent.

A healthy migration path looks like this:

CDC emits low-level changes
integration layer normalizes and enriches
domain services start publishing explicit domain events
consumers shift from table semantics to domain semantics

That is migration with direction. Without direction, CDC becomes your architecture.

Strangler by capability, not by entity

Do not migrate “customer” or “order” as giant objects if you can avoid it. Migrate capabilities:

customer notifications
order pricing
shipment visibility
product search indexing

Capabilities align better with bounded contexts and propagation needs. They also make tradeoffs explicit. Search indexing may tolerate eventually consistent push. Payment authorization likely needs synchronous pull for authoritative confirmation.

This is classic strangler architecture: let the old and new coexist, give the new system a way to observe change, and progressively move meaning outward from implementation detail toward domain language.

Enterprise Example

Consider a global retailer modernizing order fulfillment across e-commerce, stores, and warehouses.

The landscape looks familiar:

SAP handles parts of inventory and finance
a legacy order management platform owns order capture
a warehouse management system tracks physical stock
a modern e-commerce platform needs near-real-time availability
customer service tools need current order status
analytics wants event history

The naive move would be to make every system call the order platform synchronously whenever it needs state. That would centralize truth, yes. It would also collapse under load, create cascading latency, and turn the order platform into a bottleneck for everything from search to customer support.

So the retailer adopts a mixed propagation strategy.

Availability propagation: push

Inventory changes are high-volume and many consumers care. The warehouse and inventory services publish events such as StockAdjusted, ReservationCreated, and ReservationReleased to Kafka.

Downstream consumers:

e-commerce availability service updates regional ATP projections
store systems update local pickup visibility
analytics tracks reservation churn

Why push? Because freshness matters and many consumers need the same changes.

Product data: push-trigger / pull-fetch

Product attributes change less often but the payload is wide and heterogeneous. Search, recommendations, and digital assets care about different slices of data.

The product service emits ProductChanged(productId, version).

Consumers then pull fit-for-purpose views:

search fetches searchable fields
recommendation engine fetches category and relationship data
digital storefront fetches media metadata

Why hybrid? Because stuffing every product attribute into every event would create bloated contracts and constant versioning pain.

Credit and payment authorization: pull

When an order is submitted, the order service calls payment and credit services synchronously for authoritative decisions.

Why pull? Because this is not general state distribution. It is transactional decisioning where the latest answer matters more than event history.

Reconciliation

Nightly is not enough, so the retailer runs continuous reconciliation jobs:

compare inventory reservation versions between order and inventory contexts
repair stale order status projections
trigger selective replay from Kafka for failed consumers
route unresolved mismatches to operations

This ends up being one of the smartest investments in the program. Not because the design is weak, but because the business is real. Warehouses go offline. Consumers deploy bad code. Messages poison queues. Humans make corrections.

The enterprise lesson is simple: real architecture plans for drift.

Operational Considerations

A propagation strategy lives or dies in operations.

Observability

You need end-to-end visibility of state movement:

event publication success/failure
consumer lag
replay activity
API latency and error rates for pull paths
version skew between source and consumers
dead-letter queues
reconciliation outcomes

A dashboard that shows only Kafka broker health is not enough. You need business observability: “how many orders are missing fulfillment projections?” is more useful than “consumer group lag increased.”

Idempotency

Push consumers must be idempotent. Duplicates happen. Retried delivery happens. Reprocessing happens. If processing the same OrderShipped event twice causes duplicate notifications or double updates, the architecture is unfinished.

Ordering

Ordering is one of the most abused assumptions in distributed systems.

Kafka can preserve order within a partition, not across the universe. If ordering matters, define where it matters:

within aggregate instance?
within customer?
within warehouse?
globally?

Most domains need local ordering, not global ordering. Model for that.

Versioning and schema evolution

Event schemas and pull APIs evolve. If you do not govern compatibility, state propagation will become deployment roulette.

Good practices:

additive event changes where possible
explicit schema registry
consumer tolerance for unknown fields
source versions in payloads
deprecation windows
contract tests for critical consumers

Backpressure and rate control

Push can overwhelm consumers. Pull can overwhelm providers. Both need flow control:

consumer lag alarms
retry with jitter
circuit breakers on pull APIs
bulkheads for source systems
queue retention policies
snapshot catch-up paths for lagging consumers

Security and data minimization

Push often distributes data widely. Pull centralizes access but may expose source systems more broadly.

Architects should ask:

does every consumer really need PII in the event?
can notification events reduce data spread?
should consumer-specific read models hide sensitive fields?
do replay and retention policies create compliance risks?

State propagation is also data governance. EA governance checklist

Tradeoffs

Let’s be blunt. Every choice here costs something.

Push strengths

low-latency dissemination
efficient fan-out
reduced direct runtime dependence on source
strong support for event history and replay
good fit for reactive workflows

Push weaknesses

semantic coupling to event contracts
difficult consumer onboarding if history is incomplete
replay complexity for non-idempotent consumers
larger storage/retention footprint
risk of publishing implementation detail instead of domain intent

Pull strengths

simpler mental model
strong source authority
selective retrieval
easier access to current truth
often easier for transactional decision flows

Pull weaknesses

runtime coupling to source availability
risk of polling storms and API hotspots
stale data if polling interval is too coarse
weak historical replay unless source stores history
source system becomes scale bottleneck

Hybrid strengths

low-latency awareness with controlled data transfer
smaller event contracts
flexible consumer retrieval
useful for progressive migration

Hybrid weaknesses

more moving parts
dual dependency on stream and query interface
harder to test end to end
easy to hide bad event design behind endless fetches

My bias: use push for meaningful business changes, pull for authoritative decisions and selective reads, and hybrid where payload breadth or migration reality demands it.

But bias is not doctrine. Some domains need the opposite.

Failure Modes

Distributed systems have a grim talent for failing in ways diagrams politely omit.

1. Semantic drift

The producer changes what an event means without realizing consumers encoded the old meaning. Everything still deserializes. Business results quietly rot.

This is one reason domain-driven design matters. Shared words are not shared meaning unless you maintain them.

2. Lost updates

A consumer misses events due to outage, retention expiry, offset corruption, or deployment mistakes. Its local view becomes stale and remains stale because no one notices.

Without reconciliation, this can live for months.

3. Ordering bugs

OrderCancelled arrives before OrderConfirmed, or two concurrent inventory changes cross in transit. Consumers apply transitions blindly and end up in impossible states.

Versions, sequence numbers, and aggregate-level ordering rules are your friends here.

4. Pull amplification

A popular downstream service receives a traffic spike and starts hammering the source API. The source slows down, retries increase, and now half the estate is waiting on one struggling system.

This is how “simple synchronous integration” becomes an incident bridge.

5. Event payload overreach

Teams put every possible field into events “for future consumers.” Payloads become giant, sensitive, unstable, and expensive to evolve.

Canonical event models often die this way: they become political compromises instead of useful contracts.

6. Poison message paralysis

One malformed message blocks a consumer path. Lag rises. Downstream projections stall. Operations sees infrastructure green but business state stale.

Dead-letter handling and skip/replay controls are not optional in push systems.

7. False confidence in eventual consistency

People say “it’s eventually consistent” as if that were an architecture. It is not. It is an admission that time matters and correctness arrives later. The real questions are: how much later, with what business impact, and how do you repair divergence?

When Not To Use

Not every problem deserves sophisticated state propagation.

Don’t use push-heavy event propagation when:

there are only one or two consumers with modest read volume
the domain requires immediate authoritative answers, not asynchronous awareness
the source cannot produce meaningful domain events
the organization lacks operational maturity for replay, schema governance, and reconciliation

Don’t use pull-heavy designs when:

many consumers need near-real-time updates
the source system is fragile or expensive to scale
consumers need historical transitions, not just current state
polling frequency required for freshness would be absurd

Don’t use hybrid if:

your team is already struggling to operate either streams or APIs well
the event is so underspecified that every consumer must always fetch everything
you are using “hybrid” to avoid making domain ownership decisions

And one more uncomfortable point: if your enterprise has not established bounded contexts and clear ownership, no propagation strategy will save you. You will just distribute confusion faster.

Several patterns sit close to this topic.

Event sourcing

Stores state as a sequence of domain events. Powerful for replay and audit, but not necessary for most propagation problems. Don’t reach for it just because Kafka exists.

CQRS

Separates write and read models. Very relevant when push builds read projections optimized for downstream queries.

Transactional outbox

Critical for reliable event publication from services that update a database and emit events. Prevents the classic “DB committed but event lost” problem.

Change data capture

Useful migration tool and integration mechanism, but weak on domain semantics unless enriched.

Saga / process manager

Coordinates long-running business processes across services. Often depends on state propagation events but should not be confused with generic data sync.

Anti-corruption layer

Essential when new services must pull or interpret state from legacy systems without importing their ugly model into the new domain.

Materialized view

A common result of push propagation: consumers maintain local read-optimized state rather than querying the source repeatedly.

Summary

State propagation is not a plumbing debate. It is a design decision about truth, timing, and meaning across bounded contexts.

Push says changes matter and should travel.

Pull says authority matters and should be consulted.

Hybrid says awareness and retrieval can be separated.

In enterprise architecture, the best designs rarely pick one mechanism and apply it everywhere. They choose deliberately by domain:

push for meaningful, high-value change fan-out
pull for authoritative decisions and selective access
hybrid for broad payloads, migration seams, and consumer-specific projections

Use domain-driven design to decide what state means before deciding how it moves. Use strangler migration to evolve without rewriting the world. Use Kafka where event streams add value, not as a religious requirement. And always, always make reconciliation a first-class capability.

Because in distributed systems, propagation is easy.

Recovery is architecture.

The key is not replacing everything at once, but progressively earning trust while moving meaning, ownership, and behavior into the new platform.

Frequently Asked Questions

What is a service mesh?

A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.

How do you document microservices architecture for governance?

Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.

What is the difference between choreography and orchestration in microservices?

Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.