Arch. for Gradual Consistency in Distributed | NILUS

⏱ 19 min read

Distributed systems rarely fail because we forgot a queue. They fail because we lied to ourselves about time.

That’s the uncomfortable truth behind most “eventual consistency” conversations. Teams draw neat service boundaries, introduce Kafka, split a monolith into microservices, and declare victory. Then the business asks a simple question: “When a customer changes their address, when is it really changed?” Not philosophically. Operationally. Can we ship? Can we invoice? Can we pass an audit? Can customer support trust the screen in front of them? microservices architecture diagrams

This is where architecture stops being a technical diagram and becomes a semantic contract with the business.

The old fantasy was immediate consistency everywhere. One database, one transaction, one truth. The newer fantasy is that eventual consistency is free as long as we publish some events. Both are wrong. In real enterprises, truth moves in stages. Some facts are authoritative immediately in one domain, informative later in another, and only fully trusted after reconciliation. That is not a flaw. It is often the only sane design.

So the architecture we need is not merely “eventual consistency.” It is gradual consistency: a deliberate model in which data and decisions move across bounded contexts with explicit stages of confidence, authority, and business usability over time.

That distinction matters. Eventual consistency talks about convergence. Gradual consistency talks about the business journey from “accepted” to “usable” to “settled.” The first is a property of infrastructure. The second is a property of the enterprise.

This article lays out how to design for that journey: domain-driven design thinking, migration reasoning, reconciliation, Kafka-based propagation, progressive strangler migration, operational controls, and the nasty failure modes nobody puts on the first architecture slide. event-driven architecture patterns

Context

Large organizations do not have one system of record. They have a patchwork of systems that believe they are the system of record.

ERP owns inventory valuation. CRM owns customer interactions. Billing owns invoices. Order management owns fulfillment workflow. Risk owns approvals. Identity owns customer profile primitives. Every one of these domains has valid reasons to reject another domain’s model.

Domain-driven design helps here because it gives us language sharper than “master data” and “sync.” A customer in the identity domain is not the same thing as a bill-to account in finance or a party in contracts. An order accepted by commerce is not equivalent to an order booked by ERP or a shipment released by warehouse operations. Same words, different semantics. That’s where distributed architectures break: not at the transport layer, but at the semantic seams.

Gradual consistency is really about respecting those seams.

In a well-designed enterprise architecture, a fact enters the system at a point of authority, then radiates outward to other bounded contexts. Each receiving context transforms that fact into its own model, applies local rules, maybe delays action, maybe enriches the data, and eventually reaches a state where it can act with sufficient confidence. There is no universal “updated.” There are staged milestones.

Think of it less like flipping a switch and more like money clearing through a banking network. The payment is initiated, authorized, posted, settled, then reconciled. Different parties can see it at different times with different levels of legal and operational confidence. Enterprise data behaves the same way.

Problem

Most distributed system designs force a false binary:

either we preserve strong consistency through synchronous calls and distributed coordination,
or we accept eventual consistency and tell the business to live with staleness.

That binary is lazy architecture.

In reality, businesses care about which decisions can be made at which consistency stage. A customer name correction may be safe to show in service channels before it reaches invoicing. Inventory reservations may allow soft allocation before warehouse confirmation. Fraud scoring may block shipment until a later state even though order acceptance has completed. “Consistent enough” is not one thing; it’s tied to domain semantics.

The classic microservices failure pattern goes like this:

Split the monolith into services.
Put Kafka between them.
Publish domain events.
Assume downstream services will “eventually catch up.”
Discover that downstream processes trigger legal, financial, or customer-facing actions before the data is mature enough.

Now you have duplicated workflows, race conditions, compensations everywhere, support teams with spreadsheets, and executives wondering why the modern platform needs more manual intervention than the mainframe it replaced.

The issue isn’t asynchronous messaging itself. Kafka is excellent for decoupling, throughput, replay, and audit trails. The issue is pretending transport guarantees are the same as business guarantees.

A topic can be durable and your enterprise still be semantically inconsistent.

Forces

Several forces push us toward gradual consistency whether we acknowledge it or not.

1. Domain autonomy

Bounded contexts must own their models and rules. If every service must synchronously validate against every other service before acting, autonomy collapses into a distributed monolith.

2. Latency and availability

Cross-service synchronous orchestration gives the illusion of immediate certainty, but under load or partial failure it creates brittle dependency chains. Enterprises with global traffic, peak loads, and regional failover cannot build every business flow as a single request thread.

3. Different clocks for different business processes

Shipping, billing, compliance, support, analytics, and notifications operate on different tolerances. A warehouse can wait 30 seconds. Fraud may need 500 milliseconds. Monthly revenue recognition can wait hours if reconciliation is correct.

4. Heterogeneous systems

You do not get to redesign SAP, a claims platform, a warehouse management system, and three acquired SaaS products into one elegant consistency model. Architecture exists because reality is messy.

5. Audit and traceability

If facts evolve over time, we need provenance: what was known, when, by whom, and which downstream actions were taken on preliminary versus settled data.

6. Migration constraints

The hardest part is usually not the target architecture. It’s the years of coexistence with the legacy estate. During migration, both old and new paths may produce updates. That creates the dangerous period where consistency semantics are not only gradual but contested.

This is why migration strategy is architecture, not project management.

Solution

The core idea is straightforward:

Model consistency as a staged domain concept, not an accidental side effect of asynchronous integration.

That means four things.

1. Introduce explicit business states of data maturity

Stop exposing a generic “updated” event. Instead model states such as:

Accepted
Validated
Enriched
Authorized
Applied
Reconciled
Settled

These are not technical states. They are domain states. A downstream service can decide which state is sufficient for its decision.

For example, in customer profile management:

Accepted: request received and syntax-valid
Verified: identity checks passed
Applied: source-of-authority context updated
Propagated: dependent views updated
Reconciled: downstream systems confirmed semantic alignment

That language changes architecture discussions immediately. Instead of arguing about “real-time sync,” teams can ask, “Can customer support use Applied data, or must they wait for Reconciled?”

2. Separate authoritative writes from distributed read usefulness

In gradual consistency, one bounded context becomes authoritative for a change first. Others consume the change and evolve toward local readiness. This often means:

authoritative command handling in one service,
event-driven propagation through Kafka,
local projections or materialized views in downstream services,
reconciliation for late, missing, conflicting, or transformed updates.

3. Design for reconciliation as a first-class capability

Reconciliation is not a cleanup job for bad implementations. It is an architectural pillar.

Events may arrive out of order. Consumers may be down. upstream systems may reissue corrected facts. Legacy systems may produce contradictory updates. Business identifiers may merge or split. If your architecture assumes the stream alone guarantees semantic correctness, it will fail in production.

You need scheduled or continuous reconciliation that compares intended state, observed state, and domain invariants.

4. Make time visible

Every important fact should carry:

event time,
effective business time,
processing time,
source of authority,
version or sequence,
correlation and causation identifiers.

Without temporal metadata, “latest” becomes guesswork. In distributed systems, guesswork is expensive.

Architecture

A typical gradual consistency architecture has a few distinct layers.

Command and authority layer

A source bounded context handles the command and persists the authoritative change. This is where domain invariants that must be atomic are enforced. Use local ACID transactions here. Save the heroics for where they matter.

Usually this includes an outbox pattern so domain events are atomically recorded with the state change, then published to Kafka.

Event backbone

Kafka provides durable propagation, replay, partitioned ordering per key, and decoupling. It is not the source of truth by itself, but it is often the source of distribution truth. Topics should be aligned to domain events, not generic CRUD noise.

Consumer bounded contexts

Each downstream service consumes the event, maps it into its own ubiquitous language, and updates local state. Some consumers produce immediate projections. Others perform validation, enrichment, risk checks, or derived calculations before exposing a usable state.

Reconciliation and exception handling

A separate capability compares authoritative state and downstream state, identifies divergence, and either replays, repairs, compensates, or raises manual exceptions.

Operational observability

The enterprise needs visibility into propagation lag, state maturity, reconciliation backlog, poison events, duplicate handling, and semantic drift.

Here is the basic flow:

This looks conventional. The difference is in what the events and downstream states mean.

A downstream service should not simply mark “customer updated.” It should expose whether the customer data is merely propagated, locally validated, or fully reconciled. That maturity can be reflected in APIs, UI indicators, workflow guards, and operational dashboards.

Gradual timeline

A timeline diagram makes the idea concrete:

The key idea is that the source domain can truthfully say “the change is applied” before the rest of the estate is reconciled. That is not dishonesty. It is precision.

Domain semantics matter more than message flow

This architecture only works if each event is modeled as a meaningful domain fact.

Bad event:

CustomerUpdated

Better events:

CustomerLegalNameChanged
CustomerCorrespondenceAddressChanged
CustomerMerged
CustomerCommunicationPreferenceRevoked

Those are different facts with different authority, compliance implications, and downstream uses. Billing may care about legal name. Marketing may care about communication preference. Fulfillment cares about correspondence or delivery address. Using one catch-all update event guarantees coupling and ambiguity.

DDD earns its keep here. Bounded contexts should publish events in terms of their own ubiquitous language, while anti-corruption layers translate into local meaning for consumers. That translation is often where gradual consistency stages are introduced.

Migration Strategy

Most enterprises will not build this greenfield. They will crawl toward it from a monolith, ESB-centric estate, or tightly coupled service mesh.

The right migration pattern is almost always a progressive strangler. Not a big bang. Big bang migrations are the architectural equivalent of replacing an aircraft engine in mid-flight by throwing away the wings.

A pragmatic sequence looks like this:

Step 1: Identify domain seams and authority boundaries

Start with one business capability where eventual propagation is acceptable but semantics matter. Customer profile, product catalog, order status, and pricing reference data are common candidates. Not payment ledger first. Not core inventory valuation. Earn credibility before you touch the bones.

Define:

source of authority,
consumer domains,
business states of maturity,
invariants that remain local versus distributed.

Step 2: Add an outbox to the existing system

Before replacing behavior, publish authoritative domain events from the current system. This lets downstream teams build projections and consumers without moving the write path yet.

Step 3: Build downstream read models and maturity-aware consumers

Consumers should not just mirror the source schema. They should create local models fit for their bounded context. Introduce readiness states and local validation rules now.

Step 4: Introduce reconciliation early

Do not wait until go-live to discover drift. Run the stream path and reconciliation side by side with legacy integrations. Measure divergence. This is your architecture’s honesty test.

Step 5: Shift selected decisions to the new path

Move low-risk decisions first. Perhaps customer support reads from the new profile view while billing still uses legacy synchronization. Then move notifications. Then operational workflows. Then harder transactional dependencies.

Step 6: Retire point-to-point sync and duplicate writes

Only after reconciliation confidence is high should you remove direct integrations or shared database dependencies.

The migration timeline often looks like this:

Step 6: Retire point-to-point sync and duplicate writes — Retire point-to-point sync and duplicate writes

The uncomfortable phase is coexistence. Legacy batch jobs still run. New consumers subscribe to events. Some downstream systems trust the new state, others don’t. This is where teams get tempted to simplify. Resist that temptation. Coexistence is not a temporary inconvenience; it is the actual architecture for a significant period.

And in that phase, versioning and source precedence rules are non-negotiable. If both legacy and modern paths can update the same business concept, you must define conflict resolution explicitly. “Last write wins” is often just a prettier way to say “we accepted data corruption.”

Enterprise Example

Consider a global insurer modernizing customer and policy servicing.

They had a central policy administration platform, a CRM, a claims system, a billing platform, and regional document generation engines. A customer address change looked trivial on paper. In practice, it had at least five meanings:

legal address for regulated notices,
correspondence address for general communication,
risk location for underwriting,
billing address for invoices,
claims contact address for active incidents.

The old estate used nightly batch synchronization and some point-to-point SOAP integrations. The modernization program introduced microservices and Kafka. The first instinct was a single CustomerUpdated event. That design was seductive and wrong.

Why? Because changing the risk location on an active policy triggers underwriting review. Changing correspondence address should not. Billing might legally continue to use the previous cycle’s address until invoice cut-off. Claims may freeze contact changes during a fraud investigation. Same customer, different domain semantics.

The eventual design made the customer profile service authoritative for identity and communication preferences, but not for all policy-relevant address meanings. Events were split into explicit domain facts. The policy domain consumed those facts and decided whether they were informative, actionable, or blocked pending review.

The architecture worked roughly like this:

Customer Profile handled address commands and emitted typed events.
Policy Service translated them into policy-party concepts.
Billing Service consumed only billing-relevant events and applied cut-off rules.
Claims Service maintained a local customer contact projection but could reject propagation during active investigation states.
A reconciliation service compared customer profile, policy party records, and billing account state daily and on-demand.

The practical win was not that every system updated instantly. It was that every system exposed the status of the update clearly. Customer support could say: “Your mailing address is updated for service communications now. Policy risk address is under underwriting review. Billing will reflect the new address from the next cycle.” That is gradual consistency expressed in business language.

Executives liked it because support calls dropped. Auditors liked it because provenance was explicit. Engineers liked it because they no longer had to fake cross-domain atomicity.

Operational Considerations

This style of architecture lives or dies by operations.

Lag is a business metric

Track not just consumer lag in Kafka, but business readiness lag:

time from authoritative apply to local usability,
time from event publication to reconciliation,
percentage of entities stuck in intermediate states,
age of unresolved divergences.

A queue depth graph is useful. A “12,000 orders pending reconciliation for more than 30 minutes” graph is useful to the enterprise.

Idempotency is table stakes

Consumers must safely handle duplicates. Reconciliation replays will happen. Retries will happen. Producers may publish corrected versions. If applying the same event twice breaks state, you do not have a distributed architecture. You have a time bomb.

Ordering is local, not global

Kafka can preserve order per partition key, not across the universe. Choose keys according to domain consistency needs. For customer-centric facts, customer ID may be enough. For policy facts, policy ID may be the stronger key. If an operation spans identities, be explicit about orchestration and compensation.

Versioning and schema evolution

Domain events live longer than teams expect. Additive change is easiest. Semantic change is dangerous. If the meaning of a field changes, treat it as a new event version or even a new event type. Backward compatibility in transport is not enough if domain semantics shift.

Reconciliation needs ownership

Who investigates drift? Which team repairs bad mappings? Which state wins after conflict? If reconciliation is everyone’s job, it is no one’s job.

Surface consistency state in user experience

This is often overlooked. If a UI shows stale or provisional data without indication, operations people lose trust. Good enterprise systems expose badges or statuses like:

Updated
Pending downstream application
Awaiting review
Reconciled
Exception requires intervention

That sounds mundane. It is architecture. Trust is part of the system design.

Tradeoffs

Gradual consistency is powerful, but it is not free.

What you gain

Better service autonomy
Higher resilience than synchronous dependency chains
Clearer domain ownership
Better auditability and replay support
Realistic migration path from monoliths and legacy estates
Ability to scale read use cases independently

What you pay

More state modeling
More operational complexity
Reconciliation infrastructure
Temporal debugging complexity
Hard conversations with the business about consistency stages
Longer tail of intermediate states and exceptions

This architecture replaces hidden coupling with visible complexity. That is usually a good trade. Hidden complexity is worse because it only appears at 2 a.m.

Still, architects should say the quiet part out loud: this approach demands more discipline than shared database integration or naive request/response orchestration. If the organization cannot sustain event design, observability, and operational ownership, the architecture will decay fast.

Failure Modes

The most dangerous failures are semantic, not technical.

1. Generic events with no domain meaning

EntityUpdated is architecture malpractice. It forces downstream consumers to reverse-engineer intent and over-couple to source schemas.

2. No explicit maturity states

If all downstream data looks equally authoritative, users and services make decisions too early. This creates legal, financial, and customer service errors.

3. Reconciliation treated as optional

Without reconciliation, drift accumulates silently. The stream tells you what should have happened. Reconciliation tells you what actually did.

4. Conflicting authorities during migration

A monolith updates address. A new service updates address. Both publish events. Nobody defined precedence. Now support screens disagree and audit trails become fiction.

5. Distributed transactions by stealth

Teams frightened by eventual propagation sneak synchronous validations back into every flow. Soon each command depends on five services being healthy. Congratulations: you rebuilt the monolith with more network hops.

6. Local models that are mere copies

If downstream services simply replicate the source model, they inherit upstream semantics accidentally and lose bounded context integrity. That is data distribution, not domain design.

7. No business-level observability

Infrastructure dashboards stay green while customers sit in “pending” states for hours. The technology is healthy; the business process is broken.

When Not To Use

Not every problem wants gradual consistency.

Do not use it for:

hard financial ledger posting requiring atomic invariants,
tiny systems with one team and one database,
flows where human or legal consequences make intermediate states unacceptable,
organizations without operational maturity for event-driven systems,
domains where strong consistency is central to business value and scale does not justify distribution.

If you have a small internal application with modest load and a cohesive domain model, a modular monolith with one database is often the better architecture. There is no prize for using Kafka where a transaction would do.

Likewise, if a business invariant absolutely must hold synchronously across data elements, keep that invariant inside one bounded context. Domain-driven design does not require microservices. In fact, it often warns against them.

The best architecture is not the most distributed one. It is the one that matches the business truth at the lowest operational cost.

Gradual consistency often appears alongside several well-known patterns:

Outbox Pattern for atomic state change plus event publication
Inbox / Idempotent Consumer for duplicate-safe processing
Saga for long-running cross-domain workflows with compensations
CQRS for separating authoritative writes from read models
Materialized Views for local usability in consuming services
Anti-Corruption Layer for translating events across bounded contexts
Strangler Fig Pattern for incremental migration from legacy systems
Event Sourcing, sometimes, though it is not required and often overused
Master Data Management, but only where true cross-domain governance is needed; DDD usually prefers local meanings over central semantic imperialism

A useful rule of thumb: use these patterns to support domain semantics, not to avoid thinking about them.

Summary

Distributed consistency is not a yes-or-no property. In enterprise systems, it is a timeline.

A fact is accepted here, applied there, validated elsewhere, and trusted globally only after reconciliation. Designing as if that journey does not exist leads either to brittle synchronous webs or to sloppy event-driven optimism. Neither survives contact with real operations.

Gradual consistency offers a better path. It says:

define authority by bounded context,
model business states of data maturity explicitly,
propagate facts through an event backbone such as Kafka,
let downstream services decide local readiness,
build reconciliation as a first-class architectural capability,
migrate progressively with a strangler strategy,
expose consistency state to both systems and humans.

That is the real lesson. Consistency is not just about whether data converges. It is about when the business is allowed to trust it.

Good architects design for that trust.

And the best ones make time visible before time makes liars of everyone.

Frequently Asked Questions

What is a service mesh?

A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.

How do you document microservices architecture for governance?

Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.

What is the difference between choreography and orchestration in microservices?

Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.