⏱ 21 min read
Distributed systems rarely fail because teams cannot draw the boxes. They fail because nobody can explain who is really in charge when the business process goes sideways.
That is the heart of the saga discussion.
On a whiteboard, both saga orchestration and event choreography look elegant. Services stay autonomous. Databases remain private. Events flow. Compensations clean up the mess. Everyone smiles. Then the first production incident lands at 2:13 a.m., and the elegance evaporates. An order is half-approved, inventory is reserved twice, payment has been captured once, refunded once, then captured again, and support is reading Kafka topics like tea leaves. event-driven architecture patterns
This is why the orchestration-versus-choreography choice matters. It is not a stylistic preference. It shapes how the enterprise expresses business intent, how teams reason about failure, how compliance is demonstrated, and how expensive recovery becomes when reality refuses to follow the sequence diagram.
My bias is simple: start from domain semantics, not middleware fashion. A saga is not just a technical pattern for eventual consistency. It is an operational model for long-running business transactions. If you treat it as plumbing, you will build an elegant accident.
This article goes deep on that choice: central orchestrator versus event choreography, where each fits, what breaks, how to migrate, how Kafka changes the texture of the design, and when not to use either approach.
Context
Modern enterprises split systems for good reasons. Teams want independent deployment. Different domains evolve at different speeds. Scalability profiles diverge. Regulatory boundaries matter. Data ownership matters even more.
The monolith solved one problem very well: transaction coordination. A single ACID transaction could update order, payment, inventory, and shipping in one shot. That came at the price of tight coupling, synchronized release trains, and a steadily growing change tax.
Microservices flipped the trade. We gained autonomy and lost easy coordination.
That loss is not theoretical. In almost every serious enterprise landscape, there are business processes that cross multiple bounded contexts:
- Order placement across sales, payment, inventory, pricing, and fulfillment
- Insurance claim handling across intake, fraud, policy, payout, and customer notification
- Loan origination across application, underwriting, compliance, document verification, and funding
- Subscription lifecycle across billing, entitlement, CRM, tax, and analytics
These are not single-step transactions. They are workflows with decisions, timeouts, retries, partial success, and human intervention. In domain-driven design terms, they span multiple aggregates and often multiple bounded contexts. You should expect inconsistency during execution. The real question is how you manage it.
That is where sagas enter.
A saga breaks a long business transaction into local transactions. Each step commits within its service. If a later step fails, the system executes compensating actions to move the business back toward an acceptable state. Not always the original state. That distinction matters. Compensation is a business concept, not a database rollback.
There are two dominant ways to coordinate a saga:
- Orchestration: a central process manager or orchestrator decides what happens next.
- Choreography: services react to events and infer the next step themselves.
Both can work. Both can fail spectacularly.
Problem
The problem sounds simple: coordinate a distributed business process without a distributed transaction.
The real problem is nastier:
- How do you preserve business meaning across autonomous services?
- How do you ensure the process remains understandable as it evolves?
- How do you recover from partial execution and duplicated messages?
- How do you audit intent, not just transport?
- How do you change the process without turning every service into a historian of every other service’s events?
The naive version of event-driven architecture says: “Just publish events and subscribe.” That is like saying urban planning is just roads and cars. Technically true. Operationally useless.
A saga needs more than message delivery. It needs coordination semantics:
- what step started,
- what state was reached,
- what deadline expired,
- what can be compensated,
- what cannot be undone,
- who decides the next action,
- and how exceptions are reconciled.
If these semantics are not explicit, they leak into code across half a dozen services. Then the architecture diagram says “loosely coupled,” while the incident report says “nobody knew where the process logic lived.”
Forces
This is not a pattern contest. It is a tension map. The right choice depends on the forces acting on your system.
1. Business process visibility
Some enterprises need a clear, inspectable process model. Regulated industries, customer support-heavy operations, and high-value order flows need to answer basic questions fast:
- Where is this case?
- Why did it stop?
- What happens next?
- Can we intervene manually?
Orchestration makes this easier. Choreography can do it, but usually by rebuilding the process view from events after the fact.
2. Service autonomy
Teams often want services to remain independent and react only to domain events they care about. Choreography fits this instinct. It avoids a central coordinator becoming the brain of the platform.
But autonomy is not free. If every service must understand more and more event history to act correctly, autonomy decays into distributed entanglement.
3. Domain complexity
Simple event chains can thrive with choreography. Complex, branching, policy-heavy workflows usually benefit from orchestration. Once the business process includes timers, approvals, parallel branches, retries, and conditional compensations, “just react to events” becomes a polite way to hide a workflow engine in service code.
4. Auditability and compliance
A central orchestrator creates a natural audit trail: command issued, step completed, compensation triggered, timeout reached. Choreography can produce the same evidence, but with more correlation work and often more ambiguity.
5. Throughput and scalability
Choreography distributes decision-making. That can scale well operationally. Orchestration introduces a central coordination point, which must be designed for availability and horizontal scale.
This is usually solvable. The bigger issue is not raw throughput. It is whether the orchestrator becomes a bottleneck for change.
6. Team topology
If one team owns the cross-domain process end to end, orchestration is often a good fit. If domains are strongly independent and event contracts are stable, choreography can align well with team boundaries.
7. Failure handling maturity
Choreography demands more operational discipline:
- idempotent consumers
- durable event publication
- correlation identifiers
- replay strategy
- dead-letter handling
- reconciliation jobs
- business observability
Without those, choreography is not architecture. It is distributed wishful thinking.
Solution
The decision should start with a blunt question:
Is the business process itself a first-class domain concept?
If yes, lean toward orchestration.
If no, and the interactions are truly domain reactions rather than centrally governed steps, choreography may be the cleaner fit.
Saga orchestration
In orchestration, a central component—often called a process manager, workflow engine, or orchestrator—holds the state of the saga and sends commands to participating services.
The orchestrator knows the intended path:
- Create order
- Reserve inventory
- Authorize payment
- Arrange shipment
- Confirm order
If something fails, it triggers compensations:
- Release inventory
- Void payment authorization
- Cancel shipment request
- Mark order failed
This creates an explicit model of the business process. It also creates a place where policy can live. That is a strength when the process matters.
The biggest misconception is that orchestration destroys service autonomy. It does not have to. A good orchestrator coordinates process flow; it does not own internal domain rules of every service. It says what business action is requested, not how the service must implement it.
Saga choreography
In choreography, services publish domain events and other services react to them. No central coordinator tells participants what to do next.
An order service emits OrderPlaced. Inventory reserves stock and emits InventoryReserved. Payment reacts to that and emits PaymentAuthorized. Shipping reacts and emits ShipmentCreated. The process emerges from event collaboration.
This can be elegant when each step is a natural domain reaction. It is less elegant when every participant must know which combination of prior events implies that it should act. Then you get hidden workflows spread across consumers.
My rule of thumb:
- Use choreography for domain event propagation and loosely coupled reactions.
- Use orchestration for business process coordination with explicit milestones, deadlines, and compensations.
That is not dogma. It is scar tissue.
Architecture
Let us compare the architectures in practical terms.
Orchestration architecture
A typical orchestrated saga includes:
- Saga orchestrator / workflow engine
- Participating services with local transactions
- Command and event channels
- Saga state store
- Timeout and retry handling
- Compensation logic
- Operational dashboard
- Correlation and tracing
The orchestrator maintains the state machine of the process. It issues commands, consumes outcome events, records progress, and decides next actions. This can be built with custom code, but enterprises often use workflow engines such as Temporal, Camunda, Conductor, or a BPM/workflow platform. The choice depends on how explicit and durable you need the process state to be.
This is where domain-driven design matters. The orchestrator should align with a process-level domain concept, not become a dumping ground for every integration concern. A “Fulfillment Saga” is credible. A “Generic Event Router That Knows Everything” is not.
Choreography architecture
A choreography-based saga usually rests on:
- Event broker, often Kafka
- Event-producing services
- Event-consuming services
- Outbox pattern for reliable publication
- Idempotent handlers
- Correlation metadata
- Read model or process projection for observability
- Reconciliation services/jobs
Kafka is often central here because it offers durable event streams, replay, partitioning, and consumer groups. But Kafka does not solve process semantics. It solves transport and durability very well. Architects who confuse those layers usually discover the gap during failure recovery.
In choreography, no one service necessarily holds the full process view. So enterprises often add a process projection or saga monitor that listens to events and materializes a read model of business progress. That is useful, but it is worth saying plainly: once you build a service whose job is to infer process state from events, you are edging back toward orchestration in all but name.
Domain semantics discussion
This is where many designs wobble.
The event names and command names must reflect business meaning:
ReserveInventoryInventoryReservationRejectedPaymentAuthorizationExpiredShipmentCreationDeferred
Not technical mush like:
ProcessStepCompletedEntityUpdatedRecordChanged
A saga lives or dies by semantics. Compensation especially requires domain precision. “Undo” is rarely correct. If payment was captured, the compensation is not “delete payment.” It may be IssueRefund, which has accounting, timing, and customer communication consequences. If inventory was allocated to a scarce item, releasing it may trigger downstream backorder logic.
Distributed systems punish vague language because vague language becomes vague failure handling.
Migration Strategy
Nobody should rewrite a core order-to-cash platform in one swing just to get “proper sagas.” That is theater, not architecture.
The enterprise path is usually a progressive strangler migration.
Start by identifying a cross-service business process currently coordinated by:
- a monolith transaction,
- brittle point-to-point calls,
- batch reconciliation,
- or human back-office intervention.
Then peel it out gradually.
Step 1: Make the process visible
Before changing coordination, instrument the current process. Add correlation IDs. Emit domain events. Build a process view. Understand actual failure paths, not imagined ones.
Step 2: Stabilize publication with outbox
If services emit events, use the transactional outbox pattern so local state changes and event publication stay consistent. This is especially important with Kafka. Without an outbox or equivalent mechanism, you get the classic split-brain between database commit and event emission.
Step 3: Introduce a process boundary
Define the saga around a real domain concept:
- order fulfillment
- claims adjudication
- loan approval
- subscription activation
Name the states. Name the compensations. Name terminal and non-terminal outcomes. If the business cannot agree on these, you are not ready to automate the saga.
Step 4: Start with orchestration at the seam
A common migration move is to introduce an orchestrator around existing services without changing all internals immediately. The orchestrator can call legacy APIs, invoke newer microservices through commands, and produce a consistent process model while the estate evolves. microservices architecture diagrams
This works well because orchestration gives the migration a center of gravity. It lets you modernize incrementally while preserving process control.
Step 5: Push domain reactions outward where appropriate
Not every step needs orchestrator control forever. Some interactions are better expressed as domain event reactions. For example, analytics, customer notifications, or loyalty updates may subscribe to business events without being part of the core transaction path.
This leads to a healthy hybrid:
- orchestration for critical process control,
- choreography for side effects and downstream reactions.
Step 6: Add reconciliation deliberately
This is the part architects skip in the slide deck and operations pays for later.
A distributed saga needs reconciliation:
- detect stuck instances,
- compare authoritative records,
- repair process drift,
- trigger manual review when compensation is impossible.
Reconciliation is not an admission of failure. It is the mature acknowledgment that at enterprise scale, messages get delayed, contracts evolve, operators make mistakes, and external systems lie.
A good migration plan includes:
- periodic consistency scans,
- business-key-based reconciliation,
- re-drive tooling,
- poison message review,
- and clear ownership for manual exception handling.
Enterprise Example
Consider a global retailer modernizing order fulfillment.
The legacy world had a monolithic commerce platform handling order creation, payment authorization, stock allocation, and shipment booking in one transactional boundary—at least in theory. In practice, warehouse management and payment gateways were already external, so much of the “transaction” was little more than optimistic sequencing with nightly repair jobs.
The retailer moved to microservices and Kafka:
- Order Service
- Inventory Service
- Payment Service
- Shipping Service
- Customer Communication Service
- Fraud Service
The first design used pure event choreography. OrderPlaced triggered inventory reservation and fraud checks. InventoryReserved and FraudCleared independently influenced payment. PaymentAuthorized triggered shipping. Notification listened to everything.
It worked in lower environments. Of course it did.
In production, three issues emerged.
First, hidden process logic.
Payment needed to know not just that inventory was reserved, but also whether fraud had cleared and whether the order was split-ship eligible. That logic spread across event consumers. Small policy changes required touching multiple services.
Second, weak supportability.
Customer service needed to answer, “Why is my order pending?” There was no authoritative process state. Teams reconstructed order progress from Kafka topics and service logs. Support escalations became archaeology.
Third, compensation chaos.
When shipment creation failed after payment authorization, the system sometimes released inventory before refund logic completed. In high-demand scenarios, another customer could buy the same item before the original customer had been refunded. Technically consistent eventually; commercially infuriating immediately.
The retailer shifted to a central Fulfillment Saga Orchestrator for the critical path:
- Create order
- Run fraud screen
- Reserve inventory
- Authorize payment
- Create shipment
- Confirm order
Compensations became explicit:
- If shipment creation fails, void payment if still authorized, otherwise issue refund.
- Release inventory only after financial reversal reaches an acceptable state.
- Move the order into
ManualReviewif any irreversible step completed but compensation partially failed.
Kafka remained important, but not as the brain. It became the durable event backbone for state change publication, integration, and downstream consumers. Notification, analytics, recommendation, and loyalty remained choreographed listeners. The core revenue process became orchestrated.
The result was not ideological purity. It was operational sanity.
That is the enterprise lesson. The best architecture often uses both styles, but assigns them different jobs.
Operational Considerations
This is where designs earn their keep.
Idempotency
Every command handler and event consumer should tolerate duplicates. Kafka consumers can reprocess. Networks retry. Operators replay messages. If a compensation runs twice, it must not create a second refund.
Correlation and causation
Use business correlation IDs consistently:
- order ID
- claim ID
- loan application ID
Also track causation IDs so you know which command or event triggered the next action. This matters for debugging and audit.
Timeouts and deadlines
Long-running sagas need timers:
- payment authorization expires,
- inventory reservation hold expires,
- customer confirmation window closes,
- fraud review SLA breaches.
Timeouts are business events. Model them explicitly.
Observability
Distributed tracing is useful but insufficient. You need business observability:
- current saga state,
- age in state,
- compensation status,
- retry count,
- manual intervention queue,
- terminal outcome rate.
Backpressure and throughput
An orchestrator must scale, but the harder issue is downstream capacity. If shipping is degraded, the saga should not blindly flood retries. Introduce circuit breakers, controlled retry policies, and queue-based smoothing.
Data retention and replay
Kafka replay is powerful, but replaying a saga stream without guardrails can retrigger historical actions. Keep command side effects separate from event history reprocessing. Design replays for rebuilding projections, not accidentally reshipping orders.
Reconciliation
It deserves repeating. Reconciliation is the safety net:
- compare saga state against service truth,
- identify orphaned reservations,
- identify paid-but-unshipped orders,
- identify shipped-but-unconfirmed orders,
- and repair via automated or manual actions.
In large enterprises, reconciliation is not a patch. It is a permanent capability.
Tradeoffs
There is no winner in the abstract.
Why choose orchestration
- Clear process ownership
- Better visibility and auditability
- Easier reasoning about branching, deadlines, and compensations
- Faster support diagnostics
- Better fit for regulated or high-value workflows
Costs of orchestration
- Central coordinator can become a change bottleneck
- Risk of process logic over-centralization
- Added platform component and operational dependency
- Teams may feel less autonomous if the orchestrator leaks into service internals
Why choose choreography
- Strong decoupling for event-driven reactions
- Good fit for simple, emergent collaboration
- Natural alignment with Kafka-centric architectures
- Avoids a single process engine becoming the center of everything
Costs of choreography
- Process logic becomes implicit and scattered
- Harder debugging and supportability
- More difficult compensation coordination
- Audit and compliance need extra work
- Greater risk of cyclic event dependencies and semantic drift
The tradeoff is not centralized versus decentralized. The tradeoff is explicit coordination versus emergent coordination. Enterprises should be cautious about emergent coordination in revenue-critical paths. Emergence is charming in bird flocks. Less so in payment disputes.
Failure Modes
This is the section most articles soften. I will not.
1. The “god orchestrator”
The orchestrator starts coordinating one business process, then slowly absorbs business rules from every domain. Soon it knows pricing policy, fraud thresholds, warehouse allocation strategy, and customer segmentation.
That is not orchestration. That is a distributed monolith with better branding.
Prevention: keep the orchestrator focused on process sequencing and policy that truly belongs to the end-to-end workflow.
2. Event soup
In choreography, services emit poorly named events and consume each other’s internal lifecycle signals. Dependencies multiply. Nobody can say which events are stable domain contracts and which are implementation leakage.
Prevention: define event contracts with domain language and bounded context discipline.
3. Broken compensation assumptions
Teams assume every action can be undone. It cannot. Emails cannot be unread. Shipments cannot be unshipped once handed to a carrier. Funds settlements may not reverse instantly.
Prevention: model compensations as real business actions with constraints and costs.
4. Inconsistent publication
A service updates its database but fails before publishing the event. Or publishes before commit and then rolls back. Now the saga state is wrong.
Prevention: outbox pattern, transactional messaging mechanisms, or equivalent durable publication design.
5. Replay disasters
An event replay rebuilds a projection but accidentally re-executes command side effects. Refunds are issued again. Shipments duplicate. Chaos follows.
Prevention: strict separation of projection rebuild and side-effect execution, plus idempotent handlers.
6. Stuck sagas
A participant never responds, a timeout is missing, or a message is parked in a dead-letter queue without ownership. The saga hangs indefinitely.
Prevention: explicit deadlines, monitoring, and operational runbooks.
7. Semantic version drift
A service evolves event meaning without careful versioning. Downstream consumers misinterpret the event. The saga path forks silently.
Prevention: contract governance, compatibility strategy, and consumer-aware evolution.
When Not To Use
Not every distributed interaction needs a saga.
Do not use saga orchestration or choreography when:
A simple synchronous transaction is enough
If the process lives inside one service boundary and one database, use a local transaction. Do not light up Kafka and workflow engines to impress the architecture review board.
The domain can tolerate asynchronous eventual consistency without compensation
For many downstream processes—analytics, search indexing, recommendations—a failed update can simply be retried or rebuilt. No need for saga semantics.
The interaction is really just RPC dressed as events
If one service calls another immediately and expects a quick answer, orchestration may still be useful, but pure event choreography may only add latency and confusion.
The team lacks operational maturity
If you do not yet have idempotency, tracing, outbox, DLQ handling, schema governance, and reconciliation discipline, then a broad choreography strategy is a gamble. Start smaller. Prefer explicit orchestration in the critical path. EA governance checklist
The workflow is human-centric and document-heavy
In some enterprise cases, a BPM-style workflow or case management platform is simply a better fit than a custom saga model. Not every long-running process is best implemented as event choreography or code-level orchestration.
Related Patterns
A saga rarely stands alone. It usually depends on adjacent patterns.
Transactional Outbox
Ensures local data changes and event publication remain consistent. Essential in Kafka-heavy microservice systems.
Inbox / Idempotent Consumer
Prevents duplicate processing from causing duplicate business effects.
Process Manager
Often the implementation style behind orchestration. Manages long-running state transitions and commands.
Event Sourcing
Can pair with sagas, but do not assume they belong together. Event sourcing gives a history of aggregate state changes; saga coordination spans multiple aggregates and services.
CQRS
Useful for creating process projections and operational dashboards, especially in choreographed systems.
Strangler Fig Pattern
A practical migration strategy for introducing orchestrated or choreographed coordination around legacy systems incrementally.
Reconciliation Batch / Repair Workflow
The enterprise safety net. Detects and resolves mismatches after the fact.
Try-Confirm-Cancel
A narrower coordination pattern than a full saga, often useful where resources can be tentatively reserved and either confirmed or canceled.
Summary
The orchestration-versus-choreography debate is often framed as central control versus distributed elegance. That framing is too shallow.
The real design question is this: where should business process knowledge live, and how explicit should it be?
If the process is mission-critical, policy-heavy, auditable, and expensive to get wrong, orchestration is usually the safer and clearer choice. It gives the enterprise a source of truth for intent, progress, compensation, and intervention. It works particularly well in order management, claims, payments, and regulated workflows.
If the interaction is mostly a set of independent domain reactions, choreography can be excellent. Kafka-based event streams, stable domain events, and autonomous consumers create flexible architectures for downstream propagation and decoupled behavior.
Most large enterprises end up with a hybrid:
- orchestration for the core business transaction path
- choreography for side effects, derived reactions, and broader event-driven integration
That hybrid is not compromise. It is maturity.
The deeper lesson is domain-driven design. Model the saga around business semantics, bounded contexts, compensations, deadlines, and reconciliation. Name things in the language of the business. Decide deliberately where process ownership sits. And never confuse message transport with coordination.
A distributed system is not a dance just because you use events. Sometimes it needs a conductor. Sometimes it needs a jazz ensemble. The architectural craft is knowing which room you are in before the music starts.
Frequently Asked Questions
What is a service mesh?
A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.
How do you document microservices architecture for governance?
Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.
What is the difference between choreography and orchestration in microservices?
Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.