⏱ 21 min read
Most distributed systems do not fail because teams chose REST instead of Kafka, or gRPC instead of JSON over HTTP. They fail because architects pretend the business runs in one tempo. event-driven architecture patterns
It doesn’t.
Some moments demand an immediate answer. “Did my payment go through?” “Can I reserve this seat?” “Is this customer eligible?” These are conversational moments. The system is being spoken to, and it must reply like a competent adult.
Other moments are different. “Ship the order.” “Notify the warehouse.” “Update loyalty points.” “Generate invoices for the month.” These are consequential moments. They trigger a chain of work that crosses teams, systems, regulations, and time. The business doesn’t need all of that in 200 milliseconds. It needs it to happen reliably, observably, and eventually.
That is the heart of hybrid sync/async workflows in microservices: one business process, two tempos. microservices architecture diagrams
Architects who force everything into synchronous request-response build brittle systems with excellent demos and terrible resilience. Architects who force everything into event-driven asynchronous choreography build systems that are wonderfully decoupled and maddeningly hard to reason about. The real enterprise answer is usually messier and better: use synchronous interaction where the domain needs immediate intent confirmation, and asynchronous flow where the domain needs durable progress across boundaries.
This is not a compromise. It is design.
And like all good design, it starts with domain semantics, not transport protocols.
Context
Microservices introduced a useful discipline: split systems around business capabilities, not technical layers. Domain-driven design sharpened that discipline by giving us bounded contexts, aggregate boundaries, domain events, and a language that matches the business rather than the database schema.
But once teams decompose a monolith, a new problem appears. Business workflows don’t respect service boundaries.
A customer places an order. That seemingly simple act touches pricing, inventory, payment, fraud, order management, shipment, notifications, finance, and sometimes loyalty or subscription systems. In a monolith, this often lived inside one transaction, one codebase, and one illusion of consistency. In microservices, the same business flow becomes distributed. The transaction is gone. Time enters the model. Partial completion becomes normal.
That is where hybrid workflows matter.
A hybrid workflow combines:
- Synchronous interactions for immediate validation, command acceptance, and user-facing decisions
- Asynchronous interactions for propagation, side effects, long-running steps, and cross-context integration
In practice, the user might submit an order through an API, receive an immediate “Order Accepted,” while downstream processing unfolds through Kafka topics and service-local state transitions. The trick is deciding where the line goes.
This line should not be drawn by technical fashion. It should be drawn by domain meaning.
If “payment authorization” is a prerequisite for order acceptance, that likely belongs on the synchronous path. If “award loyalty points” is a consequence of successful fulfillment, that is usually asynchronous. If “inventory reservation” must happen before promising a ship date, then perhaps it stays synchronous for some products and asynchronous for others. The domain tells you what is conversational and what is consequential.
That distinction matters more than most architecture diagrams admit.
Problem
Teams often choose one interaction style and apply it everywhere.
The all-sync team builds request chains like this: API Gateway calls Order Service, which calls Inventory, which calls Payment, which calls Fraud, which calls Customer Profile, which calls Promotion, and so on. It looks neat in a slide deck. In production it turns into tail latency, cascading retries, timeout storms, and distributed blame.
The all-async team goes the other way. Every command becomes an event, every service subscribes to something, and now a basic business question—“Why is this order still pending?”—requires forensic analysis across logs, topics, offsets, dead-letter queues, and dashboards. The workflow is decoupled, but the intent is obscured.
Both approaches ignore an uncomfortable truth: enterprise workflows need both immediacy and durability.
The core problem is this:
How do you design microservice workflows that preserve domain semantics for the caller while allowing reliable, scalable, cross-service progress over time?
That problem gets sharper under real enterprise conditions:
- legacy systems that only expose synchronous APIs
- downstream platforms with variable latency
- compliance requirements around audit and reconciliation
- team boundaries across bounded contexts
- eventual consistency that must still be explainable to the business
- customer experience expectations that punish ambiguity
A checkout flow is the common example, but the pattern is broader. Claims processing. Loan origination. Telecom provisioning. Returns management. Employee onboarding. Healthcare referrals. In each case, one business action begins as a conversation and ends as a distributed process.
If you treat the whole thing as one synchronous exchange, you create fragility.
If you treat the whole thing as asynchronous diffusion, you create opacity.
The architecture must do something harder: preserve a coherent business story across both modes.
Forces
Several forces pull the design in different directions.
1. User experience wants immediacy
Users and calling systems need quick feedback. They want to know whether a request was understood, accepted, rejected, or requires more input. This favors synchronous APIs.
But immediate feedback is not the same as complete processing. Confusing those two is one of the oldest mistakes in distributed architecture.
2. Reliability wants decoupling
Long-running and cross-context work should survive retries, node crashes, deployments, and downstream outages. That favors asynchronous messaging with durable brokers such as Kafka.
Synchronous calls are conversations. Async messages are commitments.
3. Domain integrity wants clear boundaries
Domain-driven design reminds us that not all consistency is equal. Invariants within an aggregate usually need strong consistency. Coordination across bounded contexts usually does not. That means some steps must happen “now,” while others can happen “next.”
4. Operations wants observability
A workflow split across sync and async channels can become invisible unless correlation, state transitions, and business milestones are modeled explicitly. Hybrid architectures increase the need for traceability.
5. Governance wants auditability
Enterprises care about who decided what, when, and based on which facts. Event streams help, but only if events are meaningful and state machines are explicit. Random integration events are not an audit strategy.
6. Changeability wants loose coupling
Business workflows evolve. New fraud checks appear. New fulfillment partners are added. New regulations require extra approvals. Async boundaries are often where evolution becomes affordable.
7. Failure wants acknowledgment
The architecture must assume partial failure:
- synchronous timeout but downstream success
- accepted command but failed event publication
- duplicate message delivery
- out-of-order processing
- stale reads during user follow-up
- compensations that fail themselves
A good hybrid design is not one that avoids these problems. It is one that makes them survivable.
Solution
The practical answer is to split the workflow into a synchronous intent phase and an asynchronous completion phase, with explicit domain state connecting them.
Here is the opinionated version:
- Use a synchronous API to capture intent and perform only the checks required to decide whether the command can be accepted.
- Persist business state locally in the owning service.
- Publish durable domain or integration events, typically through the transactional outbox pattern.
- Let downstream services react asynchronously, each within its own bounded context.
- Model progress as state transitions, not as hidden side effects.
- Reconcile periodically because distributed systems always leak.
In other words, the synchronous path says, “We have accepted responsibility.”
The asynchronous path says, “We are now fulfilling that responsibility.”
That distinction is gold.
A common pattern is to let the owning service act as the workflow initiator. It receives a synchronous command such as PlaceOrder, validates local rules, maybe performs one or two critical synchronous checks, stores the order in a state like PENDING_CONFIRMATION or ACCEPTED, and emits an OrderPlaced event. Payment, inventory, fraud, and fulfillment then process in parallel or sequence depending on domain rules. As events arrive, the order state advances.
This is often implemented as orchestration, choreography, or a blend:
- Orchestration when one service or workflow engine owns the process logic
- Choreography when services react to events without a central conductor
- Hybrid when the core business state lives in one service, but surrounding capabilities respond independently
In enterprise settings, pure choreography is overrated. It works until nobody knows who really owns the process. If the workflow has a meaningful domain identity—Order, Claim, Application, Provisioning Request—then that entity should usually have a clear owner. That owner does not need to execute every step, but it should own the business narrative.
A reference hybrid flow
Notice what this design does not do. It does not try to complete shipping before returning to the caller. It also does not reduce the whole experience to “fire an event and hope.” It captures intent synchronously, then moves the heavy lifting into asynchronous progression.
That is the pattern in one sentence.
Architecture
A sound hybrid sync/async architecture has a few non-negotiable elements.
Domain ownership and bounded contexts
Start with bounded contexts, not services. Payment authorization belongs to the Payment context. Inventory reservation belongs to Inventory. Order lifecycle belongs to Order Management. Do not centralize business logic just because the workflow spans these contexts.
But do identify a workflow anchor—the domain object that gives the process coherence. In retail that is usually the order. In lending, the application. In insurance, the claim. In telecom, the provisioning request.
That anchor owns the externally visible lifecycle.
State machine over hidden process
Long-running workflows should be modeled as state transitions. If there is no explicit lifecycle, there is no architecture—only optimism.
This matters because domain semantics live in the transitions:
- what does “submitted” mean?
- when is an order “accepted” versus merely “received”?
- can it still be canceled in
InFulfillment? - who is allowed to move it from
FailedtoCancelled?
Those are business questions disguised as architecture decisions.
Transactional outbox and Kafka
If a service updates its database and emits an event, those two actions must not drift apart. Otherwise the classic failure appears: order saved, event lost. Or event published, save rolled back.
This is why the transactional outbox remains one of the most valuable patterns in microservices. Write the domain state and the outbound event record in the same local transaction. A relay process then publishes the event to Kafka. Consumers process idempotently.
Kafka fits well here because it offers durable event streaming, consumer groups, replay, and partitioning. It is not magic, but it is a good backbone for asynchronous workflow progression, especially where multiple downstream consumers need the same business signal.
Use domain events carefully:
OrderPlacedPaymentAuthorizedInventoryReservedOrderShipped
Avoid vague technical mush:
OrderUpdatedStatusChangedEntityProcessed
The event name should carry business meaning.
Synchronous path discipline
The synchronous path should be thin and intentional.
Use it for:
- command acceptance
- local invariant checks
- a small number of hard prerequisites
- immediate user-facing decisions
Do not use it for:
- fan-out to six dependencies
- optional side effects
- expensive enrichment
- everything that “might as well happen now”
A good rule: if the caller needs the answer to continue the conversation, consider sync. If the business needs the work to happen eventually but the caller doesn’t need proof right now, consider async.
Reconciliation as a first-class concern
Reconciliation is where grown-up architecture begins.
No matter how elegant your event design is, there will be mismatches:
- payment authorized but order stuck in submitted
- order canceled but fulfillment already started
- shipment confirmed but notification never sent
- duplicate inventory reservation due to consumer retry
You need scheduled or event-triggered reconciliation that compares system-of-record states and repairs drift. This may involve:
- replaying missed events
- issuing compensating commands
- escalating to manual operations
- rebuilding read models
- fixing orphaned workflow instances
Reconciliation is not an admission of failure. It is the cost of running distributed systems honestly.
Read models and status queries
Users do not care whether your workflow is sync or async. They care whether the answer is understandable.
That means the owning service should expose a clean status API or query model:
- current state
- last meaningful milestone
- pending actions
- failure reason if known
- timestamps and correlation identifiers where appropriate
Without that, support teams end up reading Kafka offsets to answer customer questions. That is not architecture. That is institutional surrender.
Reference component view
This is the architecture in its simplest enterprise form: one service owns the business narrative, Kafka carries consequential events, downstream bounded contexts act independently, and the anchor service keeps the externally visible state coherent.
Migration Strategy
Most enterprises do not get to start clean. They inherit a monolith, shared database habits, brittle ESB flows, and a few hundred “temporary” integrations old enough to vote.
So the migration to hybrid workflows has to be incremental. This is where the progressive strangler migration earns its keep.
Step 1: Identify the workflow anchor
Pick one end-to-end process with a clear business identity. Order, claim, application, policy change. Create a service that can own the lifecycle, even if parts of the work still happen in the monolith.
Do not begin by extracting generic utilities. Begin where the domain has a story.
Step 2: Keep synchronous ingress stable
Maintain the existing API or channel behavior if possible. Let incoming commands hit the new anchor service, which may still call the monolith for some operations. Preserve user experience first.
This reduces business risk. Migration should change the plumbing before it changes the customer contract.
Step 3: Introduce explicit states
Even if the monolith still does most of the work, have the new service model workflow states explicitly. SUBMITTED, ACCEPTED, FAILED_VALIDATION, IN_PROGRESS, COMPLETED. This creates a domain backbone before the decomposition is finished.
Step 4: Add outbox and event publication
When key milestones occur, publish durable events. At first these may simply mirror monolith outcomes. That is fine. The event stream becomes the seam for future extraction.
Step 5: Strangle one downstream capability at a time
Move a capability from synchronous in-process execution to asynchronous external handling. Inventory is often a good candidate. Notifications are easy but low value. Payment is high value but high risk. Pick according to business leverage and operational maturity.
Step 6: Introduce reconciliation before full decoupling
This is non-negotiable. Before removing old synchronous protections, create comparison jobs and repair flows. Otherwise migration succeeds in architecture review and fails in month-end finance.
Step 7: Retire hidden dependencies
As more steps move to events, remove synchronous coupling aggressively. Hybrid does not mean “keep every old call forever.” It means choose sync intentionally.
A migration truth worth remembering: the first architecture is often uglier than the final one. That is acceptable. Strangler migration is not about elegance on day one. It is about reducing risk while improving structure.
Enterprise Example
Consider a global retailer modernizing order processing across e-commerce, stores, and marketplace channels.
In the legacy world, the order management monolith handled everything in one giant transaction-shaped fantasy. It called pricing, promotion, inventory, payment, tax, and fulfillment adapters synchronously. During peak periods, one slow fraud service could stall checkout. During outages, support teams manually reconciled “ghost orders” where payment was taken but the order never appeared cleanly downstream.
The retailer moved to a hybrid workflow architecture.
What stayed synchronous
For checkout, the customer still needed immediate answers on:
- whether the cart could be submitted
- whether payment authorization succeeded
- whether limited inventory could be reserved for scarce items
So the new Order Service kept a synchronous command path for PlaceOrder. It performed local validation, called Payment Authorization synchronously, and for selected high-demand SKUs called Inventory synchronously to confirm reservation. If those prerequisites passed, the order was persisted as ACCEPTED and the customer received confirmation immediately.
What became asynchronous
Everything else flowed through Kafka:
- tax finalization
- warehouse allocation
- fulfillment initiation
- shipment creation
- customer notifications
- loyalty points
- finance posting
- marketplace partner routing
The Order Service remained the workflow anchor. It consumed milestone events and updated the order lifecycle visible to channels and support tools.
Domain semantics made the difference
The retailer learned a useful lesson: “accepted” did not mean “fully committed in every downstream system.” It meant something more precise:
> We have validated the order, secured payment authorization, confirmed mandatory availability, and accepted responsibility to fulfill or compensate.
That sentence became the semantic contract for the business. Once that was clear, the architecture choices became easier.
Reconciliation saved the program
A warehouse management system sometimes acknowledged OrderAllocated late or not at all due to an adapter issue. Without reconciliation, orders appeared stuck. The team added a reconciliation service that compared accepted orders against fulfillment milestones every 15 minutes, replayed missing events where possible, and raised operational cases otherwise.
That one capability prevented a common enterprise tragedy: a technically modern platform with financially dangerous blind spots.
What improved
- checkout latency dropped because optional downstream calls left the sync path
- order acceptance became more resilient during warehouse and notification outages
- support teams got a coherent status model
- new consumers subscribed to Kafka events without touching checkout
- marketplace integration became significantly easier
What got harder
- eventual consistency required training for product and operations teams
- duplicate events exposed weak idempotency in some downstream services
- state transition design became a genuine business governance topic
- teams had to stop treating Kafka as “just another queue”
That last point matters. Event streams are not plumbing. They are part of the enterprise operating model.
Operational Considerations
Hybrid workflows are operationally richer than simple CRUD services. Plan accordingly.
Correlation and tracing
Every command, event, and state transition needs correlation identifiers. Not because distributed tracing is fashionable, but because support and audit require a timeline. You should be able to answer:
- which request created this workflow?
- which events were emitted?
- which consumers processed them?
- what state transitions occurred?
- what failed and what was retried?
Idempotency
Kafka consumers will see duplicates eventually. APIs may receive retries. Compensations may be reissued.
So:
- commands should support idempotency keys where clients may retry
- consumers should track processed message identities or enforce natural idempotency
- state transitions should be safe against reprocessing
If your design depends on exactly-once behavior end to end, your design depends on a fairy tale.
Backpressure and retry policy
Not all failures deserve immediate retry. Some need delay, quarantine, or human intervention. A retry storm is just an outage with extra confidence.
Define:
- retryable vs non-retryable errors
- exponential backoff
- dead-letter or parking topics
- circuit breaking on synchronous dependencies
- rate limiting during downstream degradation
Schema evolution
Events live longer than HTTP payloads. Version them thoughtfully. Prefer additive changes. Maintain compatibility. Use schema governance. Do not casually rename business fields that downstream finance systems rely on. EA governance checklist
Read-your-own-write expectations
One hard user experience issue in hybrid systems is the gap between command acceptance and query consistency. If a user places an order and immediately refreshes the page, what should they see?
Options include:
- reading from the workflow anchor’s primary state
- returning the just-accepted status in the command response
- using a short-lived cache of recent writes
- making status semantics explicit: “Accepted, processing continues”
Clarity beats false precision.
Tradeoffs
Hybrid workflows are not free. They are a disciplined compromise.
Benefits
- faster and more resilient user-facing interactions
- better decoupling across bounded contexts
- scalable downstream processing
- clearer ownership of workflow state
- easier evolution of side effects and consumers
- improved recovery through replay and reconciliation
Costs
- more moving parts
- eventual consistency to explain and manage
- harder observability than simple sync APIs
- more sophisticated testing
- governance needed for event contracts and state models
- operational burden around retries, dead letters, and reconciliation
Architectural tradeoff at the center
The central tradeoff is simple:
**Synchronous design optimizes immediate certainty.
Asynchronous design optimizes durable progress.
Hybrid design accepts less of each to get enough of both.**
That is why it works.
Failure Modes
Hybrid workflows fail in specific, predictable ways. Good architects name these early.
1. Accepted but not published
Order state is stored, but the event never reaches Kafka.
Mitigation: transactional outbox, publication monitoring, replay tooling.
2. Published but not processed
The event sits in Kafka, but a consumer is down, lagging, or poison-message blocked.
Mitigation: consumer lag alerts, DLQ strategy, partition ownership monitoring.
3. Duplicate consumption
A retry or rebalance causes the same event to be processed twice.
Mitigation: idempotent consumers, deduplication keys, safe state transitions.
4. Out-of-order events
OrderShipped arrives before FulfillmentStarted, or partitioning doesn’t preserve required ordering.
Mitigation: partition by business key, version state transitions, reject impossible transitions, reconcile later.
5. Sync timeout with ambiguous result
The caller times out on payment authorization, but the payment service later completes.
Mitigation: idempotency, query-after-timeout patterns, compensating checks before retrying.
6. Semantic drift
Teams publish events whose names stay stable while meanings quietly change.
Mitigation: domain event governance, schema review, bounded context ownership. ArchiMate for governance
7. Reconciliation becomes the real system
If too many workflows rely on nightly repair jobs, the architecture is lying.
Mitigation: use reconciliation as safety net, not primary control path.
That last failure mode is common in large enterprises. Once reconciliation carries the business, your “real-time microservices platform” is mostly theater.
When Not To Use
Hybrid workflows are powerful, but not universal.
Do not use this style when:
The process is truly local
If one bounded context can complete the work atomically and quickly, keep it simple. A service talking to itself through Kafka is architecture cosplay.
The domain requires strict immediate consistency across all steps
Some workflows cannot tolerate deferred coordination. Certain financial trades, core ledger posting, or safety-critical control decisions may require stronger consistency models and tighter transactional guarantees.
The organization cannot operate asynchronous systems
If the teams lack monitoring, event governance, idempotency discipline, and operational maturity, hybrid workflows can make things worse. Async without operational literacy is distributed chaos.
The business cannot tolerate semantic ambiguity
If stakeholders insist that “accepted” must mean “everything everywhere completed,” then either change the language or keep more work synchronous. Domain semantics cannot be hand-waved away.
The scale and change rate do not justify the complexity
For modest internal systems with few integrations and stable workflows, a modular monolith is often the better answer. There is no prize for introducing eventual consistency where none is needed.
Sometimes the smartest microservices decision is not to use microservices at all.
Related Patterns
Hybrid workflows often sit alongside several related patterns:
- Saga pattern: coordinates distributed business transactions through choreography or orchestration
- Transactional outbox: ensures local state and outbound events remain consistent
- CQRS: separates command handling from query/read models, useful for workflow status views
- Strangler fig pattern: supports incremental migration from monolith to microservices
- Process manager / orchestrator: centralizes complex long-running workflow logic
- Compensation pattern: reverses or offsets prior actions when downstream steps fail
- Event sourcing: sometimes useful, but not required; often overused where simple state plus events is enough
A useful caution: these patterns work best when serving domain clarity. Used mechanically, they become architecture wallpaper.
Summary
Hybrid sync/async workflows exist because business processes live in more than one time horizon.
Some decisions must be made now. Some consequences should unfold over time. The job of architecture is not to erase that distinction but to model it honestly.
The best designs start with domain-driven design:
- identify bounded contexts
- find the workflow anchor
- define meaningful states
- separate intent acceptance from downstream completion
- publish business events durably
- reconcile the inevitable drift
Use synchronous calls for immediate intent and essential prerequisites. Use Kafka and asynchronous messaging for durable cross-context progress. Keep the externally visible lifecycle coherent. Build reconciliation before you need it. Treat event semantics as part of the domain model, not integration residue.
And above all, remember this:
**A workflow is not synchronous or asynchronous.
A workflow is a business story told across time.**
Good architecture makes that story understandable, reliable, and changeable.
Bad architecture makes it fast on a slide and mysterious in production.
Choose accordingly.
Frequently Asked Questions
What is a service mesh?
A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.
How do you document microservices architecture for governance?
Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.
What is the difference between choreography and orchestration in microservices?
Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.