⏱ 20 min read
Distributed systems fail the way old buildings leak: not all at once, not dramatically at first, but through seams, joints, and assumptions. A timeout here. A partial write there. A queue that looks healthy until you notice the lag is measured in hours. In microservices, the interesting question is rarely _whether_ failure will happen. It is where you choose to absorb it, how much business meaning survives it, and whether the system fails like a trained pilot or a panicked crowd. microservices architecture diagrams
That is why resilience in microservices should not be treated as a bag of technical tricks. Retry policies, circuit breakers, bulkheads, dead-letter topics, timeouts, idempotency keys, compensations, reconciliation jobs—these are not independent gadgets. They are layers. And if you stack them without understanding the domain, they will fight each other. The result is a system that is “resilient” in the same way a traffic jam is orderly.
The core mistake many teams make is building resilience as infrastructure theater. They install a service mesh, switch on retries, add Kafka, sprinkle a circuit breaker library around synchronous calls, and declare victory. Then an order is charged twice, an inventory reservation is never released, and the CFO learns what “at least once delivery” really means. Resilience is not achieved by multiplying mechanisms. It is achieved by placing the right mechanisms at the right layer, in service of business semantics.
This article argues for a layered view of resilience in microservices architecture: transport-level defenses, interaction-level controls, workflow-level recovery, and domain-level reconciliation. We will look at the forces that shape the design, the migration path from a brittle estate, where Kafka fits, and where this pattern should not be used. The punchline is simple: retries protect calls, but reconciliation protects truth.
Context
Microservices architecture changed the failure profile of enterprise systems. In the monolith, a method call was a method call. In microservices, a method call is a negotiation with the network, the scheduler, the DNS resolver, TLS, the downstream service, its database, and often its Kafka consumer lag. We replaced in-process complexity with distributed complexity. Sometimes that is the right trade. Often it is. But it is still a trade. event-driven architecture patterns
The modern enterprise stack amplifies this. Services talk synchronously over HTTP or gRPC for request-response needs. They publish and consume events via Kafka for decoupled workflows. They cache data locally to improve latency. They separate bounded contexts using domain-driven design, which is healthy, but also means no single service has the full truth. Each team owns its data. Each service is autonomous. Each failure is local until it becomes systemic.
This is not an argument against microservices. It is an argument against innocence.
Resilience layers become necessary because different failures demand different responses. A transient network timeout may justify a retry. A downstream pricing service that is melting under load needs a circuit breaker and a fallback. A lost “shipment-confirmed” event must be corrected by reconciliation because no amount of retry after the fact will reconstruct the business state. A payment authorization that succeeded in the PSP but failed before local persistence needs domain-specific recovery, not another HTTP attempt.
Domain-driven design matters here because resilience is not neutral. The same technical failure has different business meanings depending on the bounded context. In Customer Notifications, dropping one low-priority email may be acceptable. In Payments, ambiguity is poison. In Inventory, stale reads may be tolerable for browsing but unacceptable for final reservation. The resilience design must follow those semantics.
Problem
A single resilience mechanism, used everywhere, creates more trouble than it solves.
Take retries. They are the first instinct in distributed systems, and often the wrong first move. Retries can smooth transient faults. They can also multiply traffic against a struggling dependency, produce duplicate commands, and stretch latency beyond what users or upstream systems can bear. A retry without idempotency is just a duplicate generator with nice branding.
Circuit breakers have similar mythology. They are useful, but they do not “solve availability.” They stop repeated calls to an unhealthy dependency and let the caller degrade fast. That is valuable. But a circuit breaker around a critical write operation with no fallback may only convert long failures into short failures. If the business process cannot proceed without that dependency, the breaker is a controlled refusal, not recovery.
Kafka is often drafted in as the universal resilience answer. It is not. Kafka gives durable, ordered streams within a partition, decoupling, and replay capability. Excellent. But if services publish events from application code before the local transaction commits, or consume without idempotency, or assume event delivery is exactly once in the business sense, Kafka becomes a highly efficient machine for preserving mistakes.
The architectural problem is deeper: enterprises frequently mix transport concerns, workflow concerns, and domain concerns into one undifferentiated “error handling” layer. That leads to pathological behavior:
- HTTP client retries on timeout
- API gateway retries the same request again
- service mesh retries beneath both
- Kafka producer retries after partial acknowledgment
- consumer retries poison messages indefinitely
- batch reconciliation later “fixes” records in bulk
Every layer is trying to help. Collectively, they create retry storms, duplicate side effects, hidden latency, and semantic confusion.
The remedy is to design resilience as a stack, with clear responsibilities and explicit handoffs between layers.
Forces
Several forces push architecture in different directions.
Availability versus correctness
You can return something quickly, or you can return something correct, and in distributed systems the tension is permanent. A product catalog can tolerate stale data to preserve availability. Payment settlement cannot.
User experience versus operational safety
Users hate slow systems. Operators hate systems that amplify overload. Aggressive retries may improve success rates in light turbulence but cause collapse under real stress. The architecture must know when to stop trying.
Local autonomy versus end-to-end consistency
Microservices and bounded contexts encourage local ownership. Good. But business processes cross service boundaries. Orders, payments, inventory, fulfillment, billing—these form a workflow whether teams like it or not. Local resilience mechanisms are not enough. Cross-context reconciliation becomes essential.
Synchronous certainty versus asynchronous decoupling
HTTP gives immediate answers and simpler consumer logic. Kafka gives decoupling and temporal resilience. Most enterprises need both. The challenge is deciding where immediate feedback is required by domain semantics and where eventual consistency is acceptable.
Generic platform controls versus domain-specific recovery
Platform teams love reusable mechanisms. They should. Timeouts, retries, circuit breakers, rate limits, and dead-letter handling belong in the platform toolbox. But business recovery rarely does. “Retry payment” and “rebuild projection” are not the same class of action. The domain model must shape the final layer.
Migration speed versus architectural purity
No enterprise begins with a clean slate. There is always a monolith, a vendor package, a mainframe, or a heroic database schema holding the business together with years of sediment. Resilience layering must be introduced incrementally, often through strangler patterns, anti-corruption layers, and selective event publication rather than grand rewrites.
Solution
The practical answer is a resilience layer stack, where each layer handles a distinct class of problems and stops before it trespasses into the next.
- Transport Layer Resilience
Timeouts, bounded retries with backoff and jitter, connection pooling, TLS/DNS robustness. This layer deals with transient communication faults.
- Interaction Layer Resilience
Circuit breakers, bulkheads, rate limiting, load shedding, fallbacks where semantically valid. This layer prevents local failures from cascading and protects dependencies under stress.
- Messaging Layer Resilience
Transactional outbox, idempotent consumers, dead-letter queues/topics, replay, poison message handling, partition strategy. This layer makes asynchronous communication durable and recoverable.
- Workflow Layer Resilience
Sagas, compensations, explicit process state, timeout handling, human intervention queues. This layer manages multi-step business processes across bounded contexts.
- Domain Truth Layer
Reconciliation, ledger-style audit trails, invariant checks, re-drives from source of record, exception management. This layer restores business correctness when all lower layers did not and could not.
The important point is that these layers are not substitutes. They are concentric defenses. Retry can address a temporary TCP hiccup. It cannot determine whether an order has been legally committed. Reconciliation can detect a missing shipment event. It cannot protect a thread pool from collapse. Each layer must know its job.
A good architecture is clear about where truth lives. In DDD terms, each bounded context owns its aggregate consistency rules. Cross-context truth is never instantaneous and rarely complete. That is why reconciliation is not a shameful afterthought; it is the inevitable partner of distributed autonomy.
Architecture
A useful mental model is to treat requests and events differently but govern them with the same semantic discipline.
Synchronous path
For request-response interactions, the call chain should be short, explicit, and defensive. Set strict timeouts. Apply limited retries only for transient, safe-to-repeat operations. Use circuit breakers to fail fast when a dependency is unhealthy. Use bulkheads to isolate resource pools so one dependency cannot consume all threads or connections. If you provide fallback data, ensure it is meaningful to the domain rather than merely technically available.
For example, in a pricing query for a product page, a stale cache fallback may be perfectly acceptable. In a final checkout tax calculation, stale data may create legal and financial risk. Same downstream dependency, different semantics.
Asynchronous path
For event-driven interactions, durability and replay matter more than immediacy. Kafka is powerful here, especially for domain events, integration events, and stream processing. But the architecture must protect against common traps:
- Use the transactional outbox to avoid dual writes between database and Kafka.
- Design consumers to be idempotent, because duplicates happen.
- Separate business retries from infrastructure retries.
- Use dead-letter topics carefully; they are a parking lot, not a solution.
- Retain source events long enough to rebuild projections and support reconciliation.
End-to-end control
The synchronous and asynchronous worlds must meet in a coherent process model. A service may accept a customer command synchronously, persist intent, emit an event asynchronously, and complete downstream steps over time. The user experience might be “Order received,” while actual fulfillment remains pending. This is not a technical compromise. It is the domain being honest.
Here is a reference view of the resilience stack:
The ordering matters. Transport retries should happen before a circuit breaker opens unnecessarily, but must be tightly bounded. Circuit breakers should stop repeated abuse of an unhealthy dependency. Once an action is accepted into a business workflow, recovery shifts from transport mechanics to process state and domain correctness.
Domain semantics discussion
This is where architecture becomes less mechanical and more valuable.
In domain-driven design, aggregates define transactional boundaries and invariants. Resilience choices must respect those boundaries. If PaymentAuthorization is an aggregate with the invariant “an authorization reference may only be captured once,” then retries around capture commands must include idempotency keys aligned to that aggregate identity. If InventoryReservation allows temporary overbooking with later reconciliation, the architecture can prefer availability during browsing and stricter controls during reservation.
Too many systems treat all errors as equal. They are not.
- “Could not reach recommendation service” may justify a fallback.
- “Could not persist order aggregate” should fail immediately.
- “Payment provider timed out after request sent” creates ambiguity and requires inquiry/reconciliation, not blind retry.
- “Shipment event missing for completed order” requires cross-system audit.
Resilience is domain policy expressed through technical mechanisms.
Migration Strategy
No serious enterprise adopts this stack in one release. You migrate toward it, usually while the business continues to trade and the old estate continues to surprise you.
The sensible path is a progressive strangler migration.
Step 1: Stabilize the edges
Start where failures are visible: API calls, downstream integrations, and key user journeys. Add explicit timeouts, remove infinite waits, constrain retries, and install circuit breakers around the noisiest dependencies. This does not fix correctness, but it stops the bleeding.
Step 2: Introduce anti-corruption layers
Legacy systems often have muddy semantics. Wrap them. Build anti-corruption layers that translate between legacy contracts and your emerging domain model. This is where DDD earns its keep. You are not just mapping fields; you are protecting language and meaning.
Step 3: Add outbox-based event publication
If the monolith or legacy service currently updates its database and “best-effort publishes” events, stop. Introduce the transactional outbox pattern so changes become reliably publishable to Kafka. This creates a trustworthy event stream without immediate decomposition.
Step 4: Carve out bounded contexts by business volatility
Do not start with whatever is easiest technically. Start with domains where independent change matters: pricing, customer communications, fraud screening, fulfillment orchestration. Extract services around cohesive business capabilities and clear ownership.
Step 5: Move from call chains to event choreography selectively
Some synchronous chains should remain synchronous. Others should become asynchronous to reduce coupling and absorb spikes. Use Kafka where temporal decoupling is useful, where replay matters, and where eventual consistency fits the domain. Do not eventify everything.
Step 6: Introduce reconciliation before you think you need it
Teams often postpone reconciliation because it feels unglamorous. That is backwards. As soon as truth is split across services, you need a way to compare intended state with actual state and repair drift. Build exception queues, audit logs, and re-drive tooling early.
Step 7: Retire brittle integrations gradually
As more bounded contexts emit trusted events and own their data, let consumers migrate off direct database reads and bespoke polling integrations. This is the true strangler move: replacing hidden coupling with explicit contracts.
A migration view looks like this:
The point is not to achieve purity. The point is to move the estate from hidden failure to managed failure.
Enterprise Example
Consider a global retailer modernizing order management. The original estate was familiar: a large commerce platform, an ERP for inventory and finance, a warehouse management system, and a payment service provider. The online checkout path was a maze of synchronous calls. During peak campaigns, one slow dependency—often tax or inventory—would stretch response times, trigger retries at multiple layers, and create duplicate order submissions.
The company decided to decompose around bounded contexts: Ordering, Payments, Inventory, Fulfillment, and Customer Notifications.
What changed
- Ordering became the intake point for customer intent.
- Payments owned authorization and capture semantics.
- Inventory owned reservation truth.
- Fulfillment tracked shipment lifecycle.
- Notifications consumed events and remained explicitly best-effort.
The architecture used synchronous calls only where immediate answers were essential. Checkout still needed payment authorization feedback. But once an order was accepted, downstream fulfillment moved onto Kafka-driven workflows.
OrderService persisted the order and wrote an outbox record in the same transaction. Kafka published OrderPlaced. InventoryService consumed the event and attempted reservation idempotently. PaymentService handled authorization with a PSP, using idempotency keys per order-payment attempt. FulfillmentService acted only when payment and inventory states reached valid milestones.
The resilience stack mattered most during failure.
A PSP timeout after an authorization request was a classic ambiguity. Did the provider authorize or not? Blind retry risked double authorization. So the Payments bounded context did not simply retry the business command. It moved the payment into PENDING_INQUIRY, triggered an inquiry workflow, and reconciled with provider records. That is real resilience: not pretending ambiguity does not exist.
Similarly, if Kafka delivery lag delayed InventoryReserved, Ordering did not block the customer indefinitely. It acknowledged order receipt and exposed process state through the UI. Customers saw “Order received, confirming stock.” Honest systems build trust.
The team also implemented daily and near-real-time reconciliation:
- orders accepted but not paid,
- paid but not reserved,
- reserved but not fulfilled,
- shipped but not invoiced,
- refunded without release of inventory hold.
This uncovered defects no circuit breaker ever could.
An event flow for such an estate might look like this:
The results
Availability improved, but that was not the most interesting result. The more important change was semantic clarity. Order intake was separated from fulfillment completion. Payment ambiguity was explicitly modeled. Inventory and fulfillment stopped depending on synchronous chains they did not control. Operations gained re-drive and reconciliation tools. Peak events became manageable because overload in one bounded context no longer translated into panic everywhere else.
This is the enterprise lesson: resilience is less about fancy middleware and more about honest process boundaries.
Operational Considerations
A resilience architecture that exists only in diagrams is decoration. Operations makes it real.
Observability by layer
Instrument each resilience layer differently:
- retries and timeouts at client level,
- circuit breaker states and rejection counts,
- queue/topic lag and dead-letter volumes,
- saga timeout counts,
- reconciliation exceptions and repair success rates.
If all you have is a generic “error rate,” you will diagnose distributed failure by astrology.
Policy management
Retry budgets, breaker thresholds, and timeout defaults should be centrally governed but locally overrideable with discipline. A platform team can provide paved-road libraries. Product teams should still declare semantics: safe to retry, fallback allowed, maximum staleness tolerated.
Idempotency as a first-class concern
Commands that can be replayed must carry stable identities. Consumers must persist deduplication state where necessary. This is particularly important in Kafka consumers and payment integrations. Idempotency is one of those things everyone claims to have until an incident proves they meant “usually.”
Capacity isolation
Bulkheads are not optional in shared environments. Separate thread pools, connection pools, consumer groups, and rate limits by traffic class. A flood of promotional traffic should not starve settlement processing.
Reconciliation operations
Reconciliation needs product ownership, not just scripts. Define:
- source of truth for each invariant,
- comparison windows,
- tolerance thresholds,
- automated repair actions,
- manual case management paths,
- auditability of corrections.
This is where enterprises either become mature or remain optimistic.
Governance and domain contracts
In a DDD-oriented estate, domain events are contracts. Version them. Document semantics, not just schemas. OrderPlaced should mean a precise business fact, not “some service emitted a thing because code ran.”
Tradeoffs
Let us be candid: resilience layers add complexity.
They introduce more moving parts, more states, more operational policy, and more opportunities for teams to misunderstand semantics. Circuit breakers can hide outages behind graceful degradation while data quietly diverges. Reconciliation jobs can normalize broken upstream behavior and reduce pressure to fix root causes. Kafka improves decoupling but complicates tracing and consistency reasoning. Sagas avoid distributed transactions but create process state that must be managed for the long haul.
There is also a cultural tradeoff. Teams must accept that some business flows are eventually consistent and that “accepted” does not mean “completed.” That is an easy sentence to say and a hard organization to build around.
The biggest tradeoff is between simplicity of execution and robustness under failure. A synchronous, linear call chain is easy to understand on a whiteboard. It is also a brittle way to run a critical enterprise process across many independently deployed services. Layered resilience is messier. It wins because production is messy.
Failure Modes
Bad resilience design has recurring failure modes.
Retry storms
Multiple layers retrying the same failing request create geometric load amplification. Under stress, this can take down dependencies that might otherwise have recovered.
Duplicate side effects
Retries without idempotency lead to double charges, duplicate shipments, repeated emails, and phantom reservations. This is not a corner case. It is the default outcome of naive retry.
Silent data divergence
Systems “recover” operationally while business state drifts. Orders appear complete in one service and pending in another. Without reconciliation, these errors age into finance, compliance, and customer support issues.
Dead-letter graveyards
Poison messages accumulate in DLQs or dead-letter topics with no ownership or replay process. The architecture has preserved failure, not handled it.
Overeager circuit breaking
Poorly tuned breakers can open too quickly during brief latency spikes, causing self-inflicted outages and unnecessary fallback behavior.
Semantic fallbacks
Returning cached or default responses where correctness matters creates subtle corruption. The worst failures are the ones that look like success.
Orphaned workflows
Sagas and long-running processes can get stuck between states if timeout and compensation logic is incomplete. Human intervention paths must be designed, not wished into existence.
When Not To Use
This style of resilience architecture is not universal medicine.
Do not use the full stack when:
- the system is small and can remain a well-structured monolith,
- the domain does not justify distributed ownership,
- the team lacks operational maturity,
- consistency requirements are strict enough that a simpler centralized transaction model is better,
- traffic volumes and failure costs do not warrant the complexity,
- the business cannot tolerate eventual consistency but also cannot invest in explicit workflow and reconciliation handling.
There is no prize for building Kafka-backed, circuit-broken, saga-managed microservices to run a modest internal workflow that a modular monolith could handle elegantly.
Likewise, if the core issue is poor domain boundaries, resilience patterns will not save you. A badly decomposed system with a heroic retry stack is still badly decomposed.
Related Patterns
This resilience stack works alongside several related patterns:
- Timeouts and Retry with Exponential Backoff/Jitter
Basic transport controls for transient faults.
- Circuit Breaker
Prevents repeated calls to unhealthy dependencies.
- Bulkhead
Isolates capacity so one failure does not consume all resources.
- Rate Limiting and Load Shedding
Protects the system under overload.
- Transactional Outbox / Inbox
Avoids dual writes and supports reliable event handling.
- Idempotent Consumer
Critical in Kafka and other at-least-once messaging systems.
- Saga / Process Manager
Coordinates long-running business workflows across bounded contexts.
- Compensation
Reverses or offsets prior actions where possible.
- Anti-Corruption Layer
Protects domain language during migration from legacy systems.
- Strangler Fig Pattern
Incrementally replaces legacy functionality.
- CQRS and Materialized Views
Useful where read models need independent scaling or composition.
- Reconciliation and Ledger Patterns
Restore business truth when distributed operations drift.
The trap is to treat these as collectibles. They are tools. Start from domain semantics, then choose.
Summary
Microservice resilience is not a feature toggle. It is a layered discipline.
At the bottom, transport mechanisms like timeouts and retries handle transient communication noise. Above that, circuit breakers, bulkheads, and rate limiting prevent cascading collapse. Messaging patterns such as outbox and idempotent consumers make Kafka-based workflows dependable. Workflow patterns like sagas and compensations manage long-running business processes. And above all of them sits the layer many teams neglect until pain forces it into existence: reconciliation, the practice of restoring business truth when distributed autonomy leaves the system in disagreement.
That final layer is the mark of an enterprise architecture that has seen production.
Domain-driven design provides the compass. Bounded contexts define where consistency is local, where contracts are explicit, and where eventual consistency is acceptable. Migration should be progressive, usually through strangler patterns and anti-corruption layers, with outbox publication and reconciliation introduced early. The goal is not to eliminate failure. It is to contain technical failure, preserve semantic integrity, and make operational recovery deliberate rather than heroic.
If there is one line to keep, keep this one: retries are for uncertainty in transport; reconciliation is for uncertainty in business reality.
That is the stack. And in real enterprises, reality is the layer that always wins.
Frequently Asked Questions
What is a service mesh?
A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.
How do you document microservices architecture for governance?
Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.
What is the difference between choreography and orchestration in microservices?
Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.