Async Request-Response Patterns in Distributed Systems

⏱ 19 min read

Distributed systems have a habit of punishing wishful thinking.

The first lie teams tell themselves is that a request is a conversation. It isn’t. In a single process, maybe. In a monolith, often. But once work crosses service boundaries, networks, queues, teams, and failure domains, a request becomes a promise wrapped in uncertainty. The caller wants an answer now. The system can only honestly say: I’ve accepted your intent; the rest depends on time, load, downstream health, and whether reality cooperates.

That is where async request-response patterns earn their keep.

Most organizations arrive here the same way. A synchronous API that looked clean in early diagrams starts buckling under real-world conditions: long-running workflows, third-party dependencies, burst traffic, expensive computations, batch integrations, compliance checks, document generation, payment settlement, inventory reservation, claims processing. Suddenly the neat little POST /do-the-thing endpoint is sitting on a timeout cliff. Clients retry. Duplicate work appears. Operators get paged. Business users complain that the system is “slow,” when what they really mean is that the architecture is pretending fast certainty in a world of slow uncertainty.

The answer is not simply “make it async.” That slogan is too cheap. The real work is choosing the right contract between caller and provider: callback, polling, event-driven completion, or some hybrid. Each option carries different semantics, operational costs, security implications, and failure modes. Each says something different about who owns state, who drives the conversation, and where reconciliation lives when things drift.

This is architecture, not plumbing. The pattern you choose shapes your domain model, your integration contracts, and your migration path. Get it wrong and you create a distributed guessing game. Get it right and you create a system that speaks honestly about time.

Context

Async request-response exists because business work often outlasts HTTP patience.

A customer submits a mortgage application. A claim enters fraud review. A payment instruction goes through sanctions screening. A warehouse receives an order that triggers stock checks, shipment planning, label generation, and carrier booking. None of these are truly “request in, response out” in the narrow RPC sense. They are business processes with waiting, branching, retries, compensations, and external dependencies.

Yet many systems still expose them through synchronous APIs because synchronous interactions are easier to explain, test, and demo. Developers like immediate answers. Product teams like simple flows. API gateways like crisp status codes. But the domain does not care about our aesthetic preference for immediacy.

In domain-driven design terms, this is the moment where a bounded context must state its real promise. Is it promising an outcome now? Or is it promising to accept a command and eventually publish the result of work within its own consistency boundary?

That distinction matters.

If the domain operation is “create transfer request,” the right immediate response may be acceptance with a tracking identifier. If the domain operation is “validate account format,” synchronous may still be right because the work is local, deterministic, and cheap. One of the recurring mistakes in enterprise integration is collapsing these two into the same interface style, as if all business actions deserve the same temporal contract.

They do not.

Problem

The core problem is simple to state and hard to solve cleanly: how do you let a caller initiate long-running or uncertain work and still give them a reliable way to learn the outcome?

A naive synchronous design creates familiar pain:

  • request timeouts
  • thread and connection exhaustion
  • retries that duplicate commands
  • poor user experience under load
  • tight coupling between caller latency and downstream processing
  • fragile orchestration across services
  • false assumptions about atomicity

The opposite naive design is just as bad: “fire and forget.” The caller submits work into a black hole and has no trustworthy way to know what happened. Support teams then invent spreadsheets, database queries, and operator backchannels to answer basic questions like: Did we receive it? Is it processing? Did it fail? Can I retry safely?

That is not architecture. That is institutionalized uncertainty.

So we need a pattern that preserves a few essential truths:

  1. Acceptance is not completion.
  2. The caller needs a durable correlation mechanism.
  3. State transitions must be observable.
  4. Duplicates and retries are inevitable.
  5. Reconciliation is part of the design, not an afterthought.

The choice between callback, polling, and event-driven completion is really a choice about how these truths are exposed.

Forces

Architecture lives in forces, not ideals. Async request-response patterns are shaped by competing pressures.

User experience versus system honesty

Users want immediate feedback. Systems dealing with real workflows often cannot give final answers immediately. Good design separates submission acknowledgement from final business outcome.

Domain semantics

This is where DDD helps. A command such as SubmitClaim is not the same as an event such as ClaimSubmitted, and neither is the same as a query like GetClaimStatus. If your async pattern blurs those semantics, consumers start depending on implementation accidents.

Reliability and idempotency

Clients will retry. Gateways will retry. Message brokers will redeliver. Humans will click twice. If the system cannot handle duplicate requests or duplicate completion notifications, it will eventually corrupt business state.

Ownership of state

Who keeps the source of truth for status? The provider? The caller? A workflow engine? An integration hub? Polling tends to centralize status ownership in the provider. Callback pushes status outward. Event-driven approaches spread awareness across consumers, often beautifully, sometimes dangerously.

Security and trust boundaries

Callbacks sound elegant until you discover that exposing webhook endpoints through enterprise firewalls, API management, partner networks, mutual TLS, secret rotation, and replay protection is not a small detail. Polling may be less glamorous, but sometimes it survives governance better. EA governance checklist

Load profile

Polling can become a tax on the platform if thousands of clients ask “are we there yet?” every few seconds. Callbacks can reduce chatter but create fan-out pressure and delivery obligations. Event-driven completion scales well internally, but external consumers may not be event-native.

Operational visibility

A good async pattern gives support teams a narrative: received, accepted, processing, waiting on KYC, completed, failed validation, expired, cancelled. A bad one gives them timestamps and confusion.

Regulatory and audit requirements

In financial services, healthcare, telecom, and logistics, the system must often prove not just that it completed an operation, but how it moved through states. Async patterns must support audit trails, immutable events, and reconciliation.

Solution

There are three primary patterns worth discussing: polling, callback, and event-driven completion. In practice, many enterprises use combinations.

The common shape is this:

  1. Client submits a command.
  2. Provider validates enough to accept or reject immediately.
  3. If accepted, provider returns a correlation identifier and initial status.
  4. Work continues asynchronously.
  5. Completion is communicated via polling endpoint, callback, event, or more than one.
  6. Final consistency is checked through reconciliation.

A sober architect insists on one more thing: model the operation as a resource or process instance, not merely an HTTP trick. The client is not waiting for a magical response; it is interacting with a business process that has identity and lifecycle.

A good API says:

  • POST /payment-requests
  • returns 202 Accepted
  • includes paymentRequestId
  • status starts as RECEIVED or PROCESSING
  • later moves to COMPLETED, REJECTED, FAILED, EXPIRED, and so on

That is better than hiding reality behind a socket that stays open until somebody gives up.

Pattern 1: Polling

Polling is the workhorse. It is not fashionable, but enterprise architecture is not a fashion show.

The caller submits a request, receives an identifier, then periodically asks for status.

Pattern 1: Polling
Pattern 1: Polling

Polling works well when:

  • clients cannot host callbacks
  • trust boundaries are awkward
  • status queries are cheap
  • completion latency is moderate
  • the provider wants tight control over what status means

Polling also plays nicely with DDD because it separates command submission from process query. The bounded context that owns the workflow remains the authority for lifecycle state.

But polling has a cost. At scale, it can produce absurd amounts of read traffic. If the status endpoint is backed by transactional tables and every mobile app checks every two seconds, you have accidentally built a denial-of-service feature.

The cure is usually straightforward:

  • backoff guidance
  • Retry-After headers
  • cache-friendly status resources
  • web/mobile UX that tolerates delayed refresh
  • event streaming internally, polling externally
  • terminal state detection so clients stop asking

Polling is often the safest first step in a migration because it introduces async semantics without forcing every consumer into webhook infrastructure.

Pattern 2: Callback

In callback, the caller provides a URL. The provider pushes completion status when work finishes.

This reduces query chatter and can improve responsiveness. It also shifts complexity into outbound delivery, security, retry policy, and endpoint management.

Pattern 2: Callback
Pattern 2: Callback

Callback is attractive in partner ecosystems, B2B platforms, and internal service meshes where consumers truly need immediate completion notification. microservices architecture diagrams

But here is the line architects eventually learn: a callback is not a free response channel; it is a second distributed system.

You now own:

  • endpoint verification
  • payload signing
  • replay protection
  • duplicate callback handling
  • retries and dead-lettering
  • callback versioning
  • observability across both directions
  • support processes when the client insists “we never got it”

Callbacks also create semantic drift if used carelessly. A provider may think it is sending “completion,” while the consumer treats callback receipt as “business committed.” Those are not always the same thing, especially when the callback itself can fail after the provider has already finalized the outcome.

That is why callback systems still need a query endpoint or reconciliation API. The callback informs. The authoritative status resource confirms.

Pattern 3: Event-driven completion

Inside modern enterprises, especially those using Kafka and microservices, event-driven completion is often the best internal pattern. The requesting service submits a command. The processing context emits domain events as state changes occur. Consumers subscribe and react. event-driven architecture patterns

Pattern 3: Event-driven completion
Pattern 3: Event-driven completion

This is powerful because it aligns well with bounded contexts. The owning context processes a command and publishes facts about state transitions:

  • OrderRequested
  • InventoryReserved
  • OrderConfirmed
  • OrderRejected

Consumers can build local read models, trigger downstream processes, and decouple timing. Kafka shines here because it provides durable event streams, replay, partitioning, and broad fan-out.

But event-driven completion is not automatically a good external API. Partners and front-end clients usually do not want to subscribe to raw domain events. They want a stable status contract. So the common enterprise move is:

  • events internally
  • polling or callback externally

That is not compromise. That is sensible boundary design.

Architecture

The architecture should center on a process resource, status model, correlation strategy, and reconciliation path.

1. Model the process explicitly

Treat async work as a domain concept with identity. In DDD terms, this may be an aggregate or process manager depending on the complexity.

For example:

  • PaymentRequest
  • LoanApplication
  • DocumentGenerationJob
  • ClaimAssessment

This resource owns state transitions and business invariants around the process, even if the work itself is delegated.

2. Separate command, state, and events

A healthy architecture distinguishes:

  • command: request to perform work
  • state/resource: current lifecycle status
  • event: fact that something happened

That separation prevents consumers from reverse-engineering the system from transport behavior.

3. Use correlation IDs and idempotency keys

These are not optional. They are table stakes.

  • Correlation ID links logs, events, callbacks, and support workflows.
  • Idempotency key protects command submission against retries.

Without both, async systems become impossible to reason about under failure.

4. Define a lifecycle that means something

Statuses should reflect domain semantics, not infrastructure trivia.

Good:

  • RECEIVED
  • VALIDATING
  • UNDER_REVIEW
  • APPROVED
  • REJECTED
  • COMPLETED

Bad:

  • SENT_TO_KAFKA
  • THREAD_STARTED
  • RETRYING_HTTP

Infrastructure details belong in telemetry, not customer-facing status models.

5. Support reconciliation

Sooner or later, state will drift between participants. A callback was missed. A downstream consumer lagged. A mobile app cached stale status. A partner claims a request vanished.

So design reconciliation from the start:

  • list requests by date/status
  • replay completion notifications
  • query authoritative state by correlation ID
  • compare event log with current projections
  • produce exception queues for manual review

Reconciliation is where mature systems prove they are serious. Everything else is optimism.

Migration Strategy

Most enterprises are not building from a blank sheet. They are dragging a synchronous estate toward a more honest architecture while keeping the business running. This is where the strangler pattern matters.

Do not replace the old synchronous interface overnight. Wrap it, route around it, and progressively move work into an explicit async process model.

Phase 1: Introduce acceptance semantics

Keep the existing endpoint if necessary, but let it return a process ID and 202 Accepted for long-running work. Behind the scenes, create a process record and move execution onto a queue or Kafka topic.

Phase 2: Add status resource

Expose GET /requests/{id} or GET /requests/{id}/status. This creates a stable contract without forcing all consumers into callback support.

Phase 3: Emit domain events internally

Once the process lifecycle is explicit, publish meaningful events from the owning context. Downstream services can stop depending on synchronous chains and begin consuming state changes asynchronously.

Phase 4: Offer callback as an opt-in channel

For consumers that need push notifications, add callbacks. Do not make them the only source of truth. Keep the status endpoint authoritative.

Phase 5: Strangle synchronous orchestration

Remove long synchronous dependency chains. Replace direct waits with command acceptance and event-driven progression. Preserve legacy clients through an anti-corruption layer if necessary.

A migration diagram makes this clearer:

Phase 5: Strangle synchronous orchestration
Phase 5: Strangle synchronous orchestration

The key migration reasoning is this: introduce truthful temporal boundaries before you chase technical purity. Many teams waste time debating brokers and frameworks while their APIs still lie about completion.

Enterprise Example

Consider a global insurer processing first-notice-of-loss claims.

In the old world, the claims portal submitted a synchronous CreateClaim request. The backend attempted, in one call chain, to validate policy coverage, check duplicate claims, invoke fraud scoring, create a claim record, generate a case number, and notify downstream systems. It worked in demos. In production, fraud scoring sometimes took 20 seconds. The document service spiked under month-end load. Third-party policy validation occasionally timed out. Front-end users hit refresh, re-submitted, and created duplicates. Contact-center staff had no reliable view of in-flight requests.

The architecture was pretending that a business process was an RPC.

The insurer redesigned around an explicit ClaimSubmission process within the Claims bounded context.

  • POST /claim-submissions accepted the customer’s command.
  • The API performed lightweight validation only: schema, authorization, mandatory fields, idempotency.
  • It returned 202 Accepted with claimSubmissionId.
  • The Claims context published ClaimSubmissionReceived to Kafka.
  • Fraud, policy validation, duplicate detection, and document preparation subscribed or were orchestrated by the claims process manager.
  • A status projection fed GET /claim-submissions/{id} for the portal and call center.
  • Large broker partners could register signed callback endpoints for completion notifications.
  • Reconciliation jobs compared Kafka events, claims records, and notification logs each night, with an exception dashboard for mismatches.

What changed was not just performance. The domain became clearer.

The portal no longer assumed “create claim” meant “claim fully established in all systems.” It meant “claim submission accepted and tracked.” The final business outcome became explicit: approved, rejected, pending manual review, awaiting document verification.

That semantic clarity reduced duplicate creation, improved operator visibility, and made support conversations sane. It also exposed a truth the business had avoided: some claims genuinely require time. Good architecture did not eliminate that fact. It made it visible and manageable.

Operational Considerations

Async patterns move pressure around rather than making it disappear. Operations matter.

Status storage and projections

Status reads should be cheap. If the query endpoint requires joining half the enterprise, you have failed. Use a process store or projection optimized for status retrieval.

Observability

Track:

  • accepted requests
  • processing latency
  • completion latency
  • callback delivery success rate
  • polling volume
  • dead-letter counts
  • reconciliation exceptions

Distributed tracing helps, but business correlation IDs matter more to support teams than fancy spans.

Backpressure

Queues and Kafka absorb bursts, but only to a point. Monitor lag, partition hotspots, consumer throughput, and retry storms. Async does not excuse capacity planning.

SLA design

There are two SLAs now:

  • acknowledgment SLA: how quickly the request is accepted
  • completion SLA: how quickly the business outcome is finalized

Conflating them is a recurring executive mistake.

Security

For callbacks:

  • mutual TLS or signed payloads
  • nonce/timestamp validation
  • secret rotation
  • endpoint allowlists
  • replay defense

For polling:

  • authorization to process-specific resources
  • careful exposure of status details
  • rate limiting

Data retention

How long do you keep process records, event streams, callback logs, and reconciliation reports? The answer is usually “longer than the first estimate.”

Tradeoffs

There is no universally best pattern. There is only a best fit for a context.

Polling

Pros

  • simple mental model
  • easier through firewalls and governance
  • provider remains source of truth
  • easier consumer onboarding

Cons

  • inefficient at scale
  • delayed completion awareness
  • can create noisy traffic patterns
  • client retry behavior can be sloppy

Callback

Pros

  • responsive completion notification
  • less status polling traffic
  • useful for B2B integrations and platform ecosystems

Cons

  • much harder security model
  • duplicate and failed delivery handling required
  • consumer endpoint management is painful
  • still needs reconciliation and status query

Event-driven completion

Pros

  • excellent for internal microservices
  • scalable fan-out
  • decouples timing
  • supports replay and projections
  • aligns well with DDD and Kafka ecosystems

Cons

  • harder for external consumers
  • schema evolution discipline required
  • eventual consistency can confuse teams
  • debugging across many subscribers can be ugly

A useful rule of thumb:

  • external API: start with polling, add callback if justified
  • internal platform: use events, expose query models where needed

Failure Modes

This is where architectural maturity shows. Async systems fail in rich and creative ways.

Duplicate command submission

The client retries after a timeout, but the original was accepted. Without idempotency keys, duplicate business actions occur.

Lost callback perception

The provider sent the callback, but the consumer did not process it. Each side blames the other. Without delivery logs, signatures, and an authoritative status endpoint, support has no way out.

Stale status projections

The workflow completed, but the read model lags behind Kafka or message processing. Clients see PROCESSING while back-office teams see COMPLETED.

Poison messages and retry storms

Malformed payloads or persistent downstream failures can lead to endless retries, queue buildup, and cascading delays.

Semantic overexposure

Teams leak internal workflow steps into public status contracts. Consumers begin depending on them. Later refactoring becomes a breaking change.

Reconciliation gaps

Events and current state diverge with no mechanism to detect or repair mismatch. This is common in systems that celebrate event-driven design but forget accounting discipline.

Orphaned processes

The process record is created, but no worker picks up the task due to routing or deployment errors. Without watchdogs and timeout escalation, requests simply disappear into limbo.

A practical architect plans for all of these. Hope is not a control.

When Not To Use

Not every operation should be asynchronous.

Do not use async request-response when:

  • the work is fast, local, and deterministic
  • the user genuinely needs immediate validation and completion
  • added complexity outweighs the latency benefit
  • the domain does not tolerate ambiguous interim states
  • your team lacks operational maturity for retries, idempotency, and reconciliation

A synchronous API for checking account balance, validating a coupon code, or fetching customer preferences is perfectly fine. Turning everything into “event-driven async” is how teams create unnecessary complexity and call it modernization.

Also be cautious in domains where users interpret acceptance as commitment. If the business cannot tolerate that distinction, you may need a different design or stronger reservation semantics before acknowledging the request.

Async is not an upgrade badge. It is a response to specific forces.

Several adjacent patterns often appear alongside async request-response.

Saga / process manager

Useful when the request triggers a long-running, multi-step workflow across bounded contexts with compensations.

Outbox pattern

Essential when publishing events from a service that also updates a database. It reduces the “state changed but event not published” gap.

CQRS

Often paired with async flows: commands initiate work, query models expose status and outcomes.

Anti-corruption layer

Helpful during migration when legacy synchronous systems must coexist with new async semantics.

Dead-letter queue

A practical necessity for messages or callbacks that cannot be processed after retries.

Request acknowledgment resource

A focused pattern where the initial response creates a trackable process resource rather than pretending to return final business data.

Summary

Async request-response patterns are really about architectural honesty. They admit that in distributed systems, time matters, uncertainty matters, and completion is often a journey rather than a moment.

Polling, callback, and event-driven completion each solve the same problem from a different angle. Polling favors control and simplicity. Callback favors push-based responsiveness. Events favor internal decoupling and scalable propagation. Mature enterprises often combine them: event-driven inside, polling as the default external contract, callback as an optional acceleration lane.

The hard part is not choosing a transport. The hard part is defining domain semantics that survive failure:

  • what does acceptance mean?
  • who owns status?
  • how do clients correlate outcomes?
  • what happens on retries and duplicates?
  • how is reconciliation performed when systems disagree?

Those questions belong at the center of the design.

If you remember one thing, make it this: an async API is not a delayed synchronous API. It is a different promise. Model it explicitly, migrate toward it progressively, and build reconciliation into the foundation. Because in distributed systems, the truth does not arrive all at once. Your architecture should know how to live with that.

Frequently Asked Questions

What is a service mesh?

A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.

How do you document microservices architecture for governance?

Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.

What is the difference between choreography and orchestration in microservices?

Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.