Architecture for Partial Failure in Distributed Systems

⏱ 19 min read

Distributed systems do not fail like neat machines in textbooks. They fail like cities in a storm. One neighborhood goes dark, traffic lights keep blinking in another district, the hospital generator kicks in, and somewhere a coffee shop still takes cash and pretends nothing happened. That is the real shape of enterprise software under stress: uneven, inconvenient, and deeply tied to business consequences.

This is why partial failure matters more than failure in the abstract. In a monolith, a crash is a crash. In a distributed system, the more common and more dangerous condition is that some things work, some things don’t, and the system keeps going anyway. Orders can be accepted while payments are delayed. Inventory can be reserved while shipping labels fail. A customer can see “confirmed” while three backend services are still arguing about what confirmed actually means.

Architects who ignore this usually end up with systems that are technically distributed but operationally brittle. They build microservices, Kafka backbones, event streams, API gateways, and cloud-native runtime controls—then discover that they merely moved the old coupling into the network. The hard part was never splitting code into services. The hard part was deciding which business capabilities must fail together, which must fail apart, and how the enterprise recovers when they do not agree. cloud architecture guide

That is where architecture for partial failure earns its keep.

This article lays out a practical approach: failure isolation rooted in domain-driven design, asynchronous boundaries where business semantics allow it, reconciliation where certainty is impossible, and a migration path that does not require betting the company on a grand rewrite. I am opinionated here on purpose. Many systems are over-synchronized, under-observed, and naively optimistic about consistency. The cure is not more infrastructure. The cure is better boundaries.

Context

Modern enterprises rarely start greenfield. They accumulate systems. Core platforms, SaaS products, packaged applications, bespoke services, old databases that still matter, and message brokers nobody dares switch off. The resulting landscape is less “architecture” and more “negotiated truce.”

Then comes the business pressure: scale faster, release independently, integrate acquisitions, support digital channels, expose APIs to partners, reduce operational risk. Microservices and event-driven architecture become attractive because they promise decoupling and independent change. Kafka enters the story because once you have more than a handful of bounded contexts exchanging events, a durable log becomes more useful than a garden of brittle point-to-point integrations. event-driven architecture patterns

But distribution changes the nature of failure. Network calls time out. Consumers lag. A partitioned service continues to process stale data. A dependency slows down and drags the caller with it. Events arrive twice. Or late. Or out of order. None of these are exceptional in the true sense. They are normal weather.

So the architectural question is not “How do we prevent failure?” You will not. The better question is “How do we ensure failure stays local, recoverable, and understandable in business terms?”

That last phrase matters. Technical isolation is not enough. The architecture has to reflect domain semantics. A payment authorization timeout is not the same as an order validation error. A delayed shipment notice is not the same as a lost cancellation command. If you model them all as generic service failures, your operations team will inherit a haunted house.

Problem

Partial failure appears when one part of a distributed workflow succeeds and another does not, while the system continues running. This is the ordinary case in a distributed enterprise.

Take an order flow. The customer submits an order. The ordering service validates it. Payment service authorizes the card. Inventory service reserves stock. Pricing records the final price. Fulfillment starts downstream processing. Notification service sends confirmation.

In a tightly coupled design, this often becomes a chain of synchronous calls. It looks simple on a whiteboard. In production, it behaves like wet cardboard. A slowdown in payment affects order throughput. Inventory retries create duplicate reservations. Notifications fail and support calls spike because customers assume the order failed. One bad dependency becomes a line of falling dominoes.

The root problem is not simply technical latency. It is misplaced certainty. Architects often design as if the system can know the final state of a distributed business process immediately and atomically. Most of the time, it cannot. Trying to force that illusion creates larger blast radii, deeper runtime coupling, and more expensive incidents.

The challenge is to preserve business correctness without demanding global agreement at the moment of action.

Forces

Several forces pull against each other here.

Business continuity versus immediate consistency.

The business often prefers degraded operation to total outage. A retailer would rather accept orders for later reconciliation than shut down checkout during a downstream failure. But some domains—securities trading, for example—cannot tolerate the same looseness. Not every inconsistency is acceptable.

Autonomy versus coordination.

Services should own their data and behavior. Yet many business transactions span multiple capabilities. Excessive coordination destroys service autonomy. Too little coordination creates chaos.

User experience versus honest state.

Users want fast confirmation. Operations wants truthful status. The architecture has to support states like accepted, pending payment, reservation failed, under review. If the domain language is weak, engineers fake certainty with misleading labels.

Throughput versus recovery complexity.

Asynchronous messaging, Kafka topics, retries, dead-letter queues, and background reconciliation improve resilience and scale. They also create delayed failure, duplicate processing, and operational complexity. You trade immediate pain for managed pain.

Migration speed versus architectural integrity.

Few enterprises can redesign end-to-end. You need a progressive strangler approach: isolate the next useful boundary, move behavior incrementally, and keep the old estate alive long enough to retire it safely.

These forces are not bugs in the architecture process. They are the work.

Solution

The solution is to design for failure isolation by domain boundary, then use asynchronous coordination and explicit reconciliation where business semantics permit.

There are a few core ideas.

First, define bounded contexts carefully. Domain-driven design is not decoration here; it is the map of your failure landscape. If Ordering, Payment, Inventory, and Fulfillment are separate bounded contexts, each should own its own state transitions and invariants. What must remain consistent inside a context can often be strongly consistent. What crosses contexts should usually be treated as a business conversation, not a single atomic transaction.

Second, distinguish between command success and business completion. “Order accepted” is not the same thing as “order fulfilled.” A command can be accepted into the system, persisted, and published as an event even while downstream outcomes remain unresolved. This sounds obvious. Many systems still fail to model it.

Third, use asynchronous messaging—Kafka is a strong fit in many enterprises—to decouple runtime dependencies between contexts. Kafka is not magic, but it is very good at absorbing unevenness. Producers can continue writing events while consumers recover. Multiple consumers can react independently. Replay supports recovery and new downstream capabilities. The log becomes part integration backbone, part operational truth source.

Fourth, embrace reconciliation as a first-class capability. Reconciliation is not an apology for bad architecture. It is the architecture. If you have independent services, eventual consistency, retries, and external dependencies, you need ways to detect mismatches, re-drive workflows, compensate side effects, and explain state to the business.

Finally, build explicit degradation paths. Decide what happens when payment is slow, when inventory is unavailable, when notifications are down, when Kafka consumers lag, when a third-party endpoint is returning success but not actually processing requests. Partial failure architecture is as much about controlled disappointment as it is about resilience.

Architecture

The architectural shape I recommend is straightforward but disciplined:

  • Bounded contexts aligned to business capabilities
  • Local ACID transactions inside each context
  • Domain events published after local commit
  • Kafka topics for inter-context communication
  • Sagas or process managers for long-running business workflows
  • Idempotent consumers
  • Reconciliation and compensating actions
  • Explicit business states visible to users and operators
  • Circuit breakers, timeouts, bulkheads, and backpressure at synchronous edges

Here is the high-level view.

Diagram 1
Architecture

Notice what is absent: a big central transaction manager and a long chain of synchronous calls. That omission is the point. If Ordering must synchronously wait on Payment, Inventory, Fulfillment, and Notification before it can say anything useful, then Ordering does not really own order acceptance. It merely brokers a distributed lock disguised as business logic.

A better model is that Ordering owns the lifecycle of the order aggregate and emits an OrderAccepted event once it has validated and persisted the initial business fact. Downstream contexts then perform their work according to their own rules. The order progresses through states as events arrive.

This leads naturally to a saga-style orchestration or choreography model. In practice, most enterprises use a hybrid. Pure choreography can become hard to understand at scale; pure orchestration can become a central brain that knows too much. The right answer is often to keep local decisions in each domain service and add process managers for cross-cutting workflows that require visibility and timeout handling.

Diagram 2
Architecture for Partial Failure in Distributed Systems

Now the important DDD point: these events must carry domain meaning, not technical noise. InventoryReserved is better than ReservationServiceCallSucceeded. PaymentAuthorizationTimedOut is better than ErrorCode=504. Domain semantics shape recovery. If the event language is technical, the business process becomes unreadable and reconciliation becomes guesswork.

Domain semantics and invariants

A distributed system needs a precise answer to two questions:

  1. What must be true immediately?
  2. What may become true later?

Inside Ordering, you might insist that an order cannot exist without at least one line item, a customer identity, and a pricing snapshot. Those are local invariants. Keep them transactional.

Across contexts, you usually loosen timing while preserving traceability. Payment authorization may happen later. Inventory reservation may happen later. Fraud review may block completion. Shipment creation definitely happens later. So the aggregate state must reflect pending and exceptional conditions honestly.

This is where many teams go wrong: they let technical architecture choose the state model. It should be the reverse. If the domain says an order can be accepted but not yet secured, then the architecture should support that state explicitly. If the domain says overselling is unacceptable, then inventory reservation semantics need stronger guarantees, perhaps even synchronous reservation inside a narrow high-value path. Architecture follows business risk.

Failure isolation zones

Design services and infrastructure so one failing area does not consume the whole platform.

  • Bulkheads separate thread pools, connection pools, consumer groups, and compute quotas.
  • Timeouts prevent waiting forever.
  • Circuit breakers stop repeated calls into a failing dependency.
  • Queues and Kafka topics absorb temporary mismatch in processing rates.
  • Backpressure prevents downstream slowness from overwhelming producers.
  • Rate limits contain bad clients and accidental storms.

Failure isolation is not only runtime mechanics. Data isolation matters too. Shared databases are partial failure anti-patterns because they couple services at the deepest layer. If a “separate” service must lock or query another service’s tables, your outage map just became much wider.

Migration Strategy

Nobody gets to this architecture by decree. Enterprises get there by careful strangling.

The right migration strategy is progressive, domain-led, and brutally practical.

Start by identifying a business capability with clear boundaries and painful coupling. Order notifications, payment processing adapters, or inventory availability are common first candidates. Then separate that capability behind an API or event boundary while the legacy system still runs the rest. This is not a purity exercise. It is controlled extraction.

A typical strangler path looks like this:

  1. Expose domain events from the legacy core.
  2. Even if the core remains monolithic, publish key business events such as OrderAccepted, PaymentCaptured, ShipmentDispatched. Use outbox patterns to avoid dual-write inconsistency.

  1. Build new downstream services as event consumers.
  2. Start with capabilities that are naturally asynchronous and low risk, such as notifications, analytics, customer communications, or status projection.

  1. Introduce a canonical workflow state outside the monolith where needed.
  2. Often a process manager or state projection helps expose a more resilient operating model without changing the whole transactional heart on day one.

  1. Peel away one decision at a time.
  2. Move inventory reservation or payment orchestration into new bounded contexts only when the event contract and reconciliation processes are mature enough.

  1. Retire synchronous dependencies gradually.
  2. Replace “legacy core calls adapter synchronously” with “core publishes event, new service reacts, status flows back.”

  1. Keep reconciliation in place throughout.
  2. During migration, inconsistency risk is higher because some logic lives in the old world and some in the new.

Here is a simple strangler view.

Diagram 3
Architecture for Partial Failure in Distributed Systems

A word on migration reasoning: do not begin with the most coupled, business-critical transactional core unless the current situation is already intolerable. Start where event contracts can stabilize, where user-visible value appears quickly, and where operational learning is affordable. Partial failure architecture is not just code change; it is organizational retraining. Teams must learn to operate with delayed outcomes, retries, and reconciliation dashboards.

Enterprise Example

Consider a large retailer running e-commerce across several countries. The original platform is a classic layered monolith: checkout, payment, stock allocation, promotions, shipment booking, and customer notifications all live in one application, backed by a large relational database. As traffic grew, the team split out services around payment gateways, inventory, and fulfillment. They also introduced Kafka as the event backbone.

On paper, the system became “microservices.” In practice, checkout still synchronously called payment and inventory before confirming the order. During holiday peaks, a third-party payment provider introduced intermittent latency. Checkout threads piled up. Circuit breakers opened too late. Inventory reservation retries flooded the database. Customer support saw duplicate orders, phantom stock shortages, and payment captures without visible order confirmation.

The fix was not to tune thread pools harder. The fix was to change the semantic contract.

The retailer redefined the order lifecycle:

  • OrderSubmitted meant customer intent captured.
  • OrderAccepted meant local validation complete and order recorded.
  • OrderSecured meant payment authorized and inventory reserved.
  • OrderException meant manual review or automated recovery required.
  • OrderConfirmed meant fulfillment released.

Ordering was changed to commit locally and publish OrderAccepted. Payment and Inventory consumed that event independently through Kafka. A lightweight process manager correlated outcomes and advanced the order state. The customer UI was updated to show “We’ve received your order and are confirming payment and stock” rather than pretending certainty. Notifications became state-driven rather than request-driven.

What happened?

  • Checkout stayed available during payment provider slowness.
  • Orders accumulated in PENDING_SECURITY rather than causing front-end outages.
  • Reconciliation jobs re-drove failed payment attempts or released stale inventory reservations.
  • Support could see exact state and reason codes.
  • The blast radius of payment outages shrank from “checkout failure” to “delayed confirmation.”

There were tradeoffs. Some customers no longer got instant final confirmation. The business had to accept a small pool of pending orders. Product teams had to improve messaging. Finance demanded tighter reconciliation between authorization, capture, and cancellation. But the enterprise moved from brittle certainty to resilient honesty. That is a good trade in most commerce platforms.

Operational Considerations

Partial failure architecture lives or dies in operations.

Observability

You need end-to-end traceability across commands, events, and state transitions. Correlation IDs are not optional. Every order, payment attempt, reservation, and saga instance needs a stable business identifier. Logs without business keys are just noise with timestamps.

Metrics should include:

  • event lag by consumer group
  • retry rates
  • dead-letter volume
  • state transition latency
  • reconciliation backlog
  • percentage of orders in pending/exception states
  • compensation frequency
  • duplicate detection rate

Dashboards should be organized by business flow, not just service health. “Payment service CPU” matters less than “orders pending payment beyond SLA.”

Reconciliation

Reconciliation deserves its own section because it is the safety net and often the neglected one.

A good reconciliation capability can:

  • compare intended versus actual business state
  • detect missing events
  • identify stuck sagas
  • re-drive idempotent processing
  • trigger compensation
  • raise human review tickets for irreducible ambiguity

Typical examples:

  • payment authorized but no order progression event
  • inventory reserved but payment failed and reservation not released
  • shipment created twice
  • external PSP returned timeout but later processed authorization
  • Kafka consumer processed event but failed before writing acknowledgment

Reconciliation often needs multiple data sources: service stores, Kafka offsets, third-party reports, and operational snapshots. It is ugly work. It is also real enterprise architecture.

Data contracts and schema evolution

Events live longer than APIs. If you use Kafka, plan for schema versioning, compatibility rules, and contract governance. A broken event contract can create a distributed partial failure faster than any network issue. EA governance checklist

Idempotency

Retries are inevitable, so consumers must be idempotent. That usually means business keys, deduplication stores, natural idempotent commands where possible, and side-effect protections. “At least once” delivery without idempotent handling is just a duplicate generation service.

Human operations

Some failures are not auto-resolvable. Build tools for support, operations, and finance to inspect state, replay safely, and understand outcomes in business language. If your architecture requires shell access and Kafka command-line tools during incidents, you have built a system for engineers, not an enterprise.

Tradeoffs

This style of architecture is strong medicine. It helps a lot, but it comes with costs.

Pros

  • Better resilience under dependency failure
  • Smaller failure blast radius
  • Independent scaling of services
  • Better support for long-running workflows
  • Improved auditability through event logs
  • Easier integration with new consumers and downstream capabilities

Cons

  • Higher design complexity
  • More operational moving parts
  • Eventual consistency and user-facing intermediate states
  • Need for reconciliation and compensation logic
  • Harder debugging when event flows sprawl
  • Greater demand for domain clarity

The biggest tradeoff is psychological. Many organizations are comfortable with synchronous request/response because it feels definitive. Asynchronous workflows force the business to acknowledge uncertainty. That is uncomfortable, but it is often more truthful.

Failure Modes

Let us be concrete. Partial failure architecture has its own failure modes.

Poison events

A malformed or semantically unexpected event repeatedly crashes consumers. If not isolated, lag grows and downstream state stalls.

Replay damage

Reprocessing old events without proper idempotency duplicates side effects—emails, charges, reservations, shipments.

Orphaned workflows

A saga waits forever because one event never arrives or correlation fails.

Silent divergence

Two services think different things happened, and nobody notices until financial close or a customer complaint.

Broker dependency concentration

Kafka improves decoupling between services, but it also becomes critical infrastructure. If operated poorly, it turns into a shared point of pain.

Compensation failure

Reversing a business action is often harder than performing it. Refund systems, reservation release, and legal audit constraints can all interfere.

Domain model erosion

Teams stop using meaningful domain events and drift into technical event spam. Once that happens, nobody can reason about business recovery.

These are not arguments against the approach. They are reminders that resilience patterns merely shift where discipline is required.

When Not To Use

Do not use this style everywhere.

If you have a small, cohesive application with a single team, modest scale, and no real need for independent deployment, a well-structured monolith is usually better. Monoliths avoid a great deal of distributed failure entirely. That is not backwardness; that is restraint.

Do not force asynchronous eventual consistency into domains that require immediate, global correctness and cannot tolerate ambiguity without severe business or regulatory consequences. Some financial ledgers, safety systems, and control platforms need different transaction and consistency models.

Do not introduce Kafka and a fleet of microservices simply because the organization wants modern architecture optics. If your teams cannot operate event-driven systems, do schema governance, or support reconciliation, you will trade simple outages for sophisticated confusion. microservices architecture diagrams

And do not split domains before you understand them. Domain-driven design is a precondition here, not a finishing touch.

Several patterns commonly sit alongside partial failure architecture:

  • Saga pattern for coordinating long-running workflows
  • Outbox pattern for reliable event publication from transactional systems
  • Inbox/deduplication pattern for idempotent consumers
  • CQRS projections for read models and operational views
  • Circuit breaker for synchronous dependency protection
  • Bulkhead isolation for resource containment
  • Dead-letter queue/topic for failed message handling
  • Compensating transactions for business rollback
  • Strangler Fig pattern for incremental migration from monoliths
  • Anti-corruption layer for legacy integration and domain protection

The important thing is not to collect patterns like badges. Patterns are useful only when they reinforce a coherent model of the domain and its failure boundaries.

Summary

Architecture for partial failure begins with a hard truth: in distributed systems, “working” is usually uneven. Some parts succeed, some lag, some fail, and the enterprise must continue operating without lying to itself.

The right response is not heroic infrastructure alone. It is better boundaries. Use domain-driven design to decide where invariants truly belong. Keep transactions local. Let bounded contexts communicate through meaningful events, often over Kafka when scale and integration demands justify it. Model business states honestly. Build reconciliation into the heart of the design. Migrate progressively with a strangler strategy, not a grand rewrite fantasy.

Most of all, accept that resilience is not the absence of failure. It is the ability to contain failure, explain it, and recover in business terms.

That is the architecture game in the real world. Not preventing every storm. Building a city that can keep going when one district floods.

Frequently Asked Questions

What is a service mesh?

A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.

How do you document microservices architecture for governance?

Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.

What is the difference between choreography and orchestration in microservices?

Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.