Modeling Long-Running Processes with Sagas

⏱ 18 min read

Distributed systems do not fail all at once. They fray.

A payment gets authorized, but inventory never reserves. A shipment is created, but the customer profile update times out. A cancellation request arrives halfway through fulfillment and nobody can say, with confidence, what “current state” even means. This is the everyday mess of long-running business processes in modern enterprises: not dramatic collapse, but a thousand paper cuts across service boundaries.

That is where sagas earn their keep.

Not because they are fashionable. Not because microservices somehow require them. And certainly not because they offer some magical replacement for transactions. They matter because businesses run on processes that outlive a single request, cross bounded contexts, and need to survive partial failure without pretending the world is atomic. microservices architecture diagrams

A saga is, at heart, an explicit model of a business journey that unfolds over time. It accepts a hard truth: in distributed systems, consistency is often negotiated rather than imposed. If you treat a long-running process as if it were a single database transaction, the system will eventually embarrass you. If you model it as a sequence of meaningful domain steps, each with its own completion and compensation semantics, you stand a chance of building something that works under real load, with real organizations, and real failure.

That last point matters. Sagas are not just a technical pattern. They are a domain pattern wearing technical clothes.

Context

Most enterprises do not start with sagas. They start with an application that owns everything: order entry, payments, inventory, shipping, invoicing, customer notifications. Then scale, organizational pressure, and delivery speed start to pull that monolith apart. Teams split around business capabilities. Separate services emerge. Kafka appears. Databases multiply. What used to be a local transaction becomes a cross-system process. event-driven architecture patterns

Suddenly, “place order” is no longer a method call. It is a business conversation.

One service validates the order. Another reserves stock. A payment service obtains authorization. A fulfillment service creates a shipment. A customer communications service sends a confirmation. Finance records the receivable. Fraud may intervene. Cancellation can happen at almost any moment. The process may take milliseconds in the happy path or hours when human review enters the picture.

This is exactly the kind of environment where domain-driven design becomes practical rather than philosophical. Once the estate is split into bounded contexts, each context owns its own model, data, and invariants. Inventory cares about reservation windows and stock positions. Payments cares about authorization, capture, reversal, and settlement. Fulfillment cares about pick, pack, and ship. Trying to coordinate all of that with one global transaction is not architecture. It is denial.

Sagas let us coordinate across those bounded contexts while preserving local autonomy.

Problem

The core problem is simple to state and painful to solve:

How do you execute a business process that spans multiple services and databases, where each step may succeed or fail independently, without distributed locking and without losing business meaning?

Traditional ACID transactions solve this inside a single database boundary. Two-phase commit tries to solve it across systems, but in most enterprise environments it is brittle, expensive, and operationally unpopular. It couples availability to coordination. It also assumes all participants can and should join the same transaction protocol, which is more theory than reality once SaaS platforms, event streams, legacy systems, and third-party APIs enter the picture.

So we need another model.

The saga model breaks a long-running process into a sequence of local transactions. Each local transaction updates one service’s state and publishes an event or sends a command that triggers the next step. If something goes wrong, the saga does not roll back in the database sense. It executes compensating actions where possible and reconciliation where necessary.

That distinction is crucial. Compensation is not rollback. You cannot “unship” a parcel that left the warehouse with the same elegance as undoing an INSERT. You can initiate a return, issue a refund, notify operations, and post an accounting correction. The business world has scars. Sagas reflect that reality.

Forces

Several competing forces shape any saga design.

Domain integrity versus service autonomy

Each bounded context wants to preserve its own invariants. Payments must never capture more than authorized. Inventory must not oversell. Fulfillment should not ship canceled orders. Yet no single service can own the full process. Saga design is the art of coordinating these local truths without inventing a fake global one.

Responsiveness versus certainty

Users expect quick feedback. The business wants correctness. A saga often means the system can acknowledge a request before every downstream step is complete. That improves responsiveness and resilience, but it creates an intermediate state that product teams must understand and expose clearly: Order Pending Confirmation is not a bug. It is a business state.

Availability versus consistency

This is the old distributed systems bargain in enterprise dress. If the inventory service is slow, do we block order intake or accept the order and resolve later? Some businesses choose hard reservation before confirmation. Others accept oversell risk and rely on backorder or substitution. This is not a technical choice alone. It is a policy choice encoded in architecture.

Orchestration versus choreography

Should one component direct the process step by step, or should services react to events and infer what to do next? Orchestration provides visibility and explicit flow. Choreography reduces central control and can align well with autonomous teams. Both can go wrong in familiar ways: orchestration becomes a god service; choreography becomes an invisible pinball machine.

Technical failure versus business failure

A timeout is not the same as a declined payment. A duplicate message is not the same as a canceled order. Good saga models separate infrastructure concerns from domain outcomes. If you blur them, operators get false alarms and business users get nonsense states.

Solution

A saga models a long-running process as a series of domain-relevant steps, each executed in a local transaction, with explicit transitions for success, failure, timeout, retry, and compensation.

This sounds tidy on paper. In practice, the winning move is to anchor the saga in business language, not transport mechanics.

Instead of saying:

service A emits event X
service B consumes and emits Y
service C retries Z

say:

order is accepted
inventory is reserved
payment is authorized
fulfillment is instructed
order is confirmed

The first list is implementation. The second list is architecture.

A saga has a timeline, not just a workflow. Some actions happen immediately; some can wait; some expire; some are retried; some need human intervention. This is why the pattern is especially effective for long-running processes. It gives time a first-class role in the model.

Here is a simplified order saga:

And when things go wrong:

Diagram 2 — Modeling Long-Running Processes with Sagas

Notice what is happening here. We are not restoring a pristine pre-transaction snapshot. We are progressing the business process into a new, valid outcome after failure. That is how grown-up enterprise systems behave.

Architecture

There are two dominant saga styles: orchestration and choreography.

Orchestrated saga

An orchestrator holds the process state and issues commands to participants. It knows the timeline, expected responses, timeout windows, and compensation paths.

This is usually my default recommendation for complex enterprise processes. Not because choreography is wrong, but because explicit process state is gold when compliance, support, and operability matter. The larger the organization, the more someone eventually asks, “Where exactly is order 784931?” An orchestrator gives you one place to answer.

The orchestrator need not be a heavyweight BPM suite. In many cases it is a dedicated process manager service with a durable state store, command/event handlers, timeout scheduling, and audit history.

Key responsibilities include:

storing saga state
correlating messages
enforcing sequencing rules
handling retries and timeouts
triggering compensations
surfacing process status to users and operators

Choreographed saga

In choreography, services emit domain events and other services react. There is no central conductor. The process emerges from collaboration.

This can be elegant for simpler, high-volume flows where the sequence is stable and responsibilities are cleanly partitioned. Kafka is often the backbone here: one service publishes OrderPlaced, another emits InventoryReserved, another reacts with PaymentAuthorized, and so on.

But choreography has a habit of aging badly when the process becomes more nuanced. Add timeout handling, partial fulfillment, fraud review, and region-specific compliance, and soon nobody can draw the flow without opening six codebases and three topic subscriptions. Event-driven architecture is powerful. It is also a wonderful place to hide complexity.

My rule of thumb is blunt: if the business process has meaningful branching, deadlines, operator visibility needs, or compensation complexity, lean toward orchestration.

Kafka and saga design

Kafka fits naturally into saga architectures, especially as an event backbone between bounded contexts. It gives durable messaging, replay, partitioning, and decoupling. It does not solve process semantics for you.

That is worth repeating. Kafka is a pipe, not a policy.

To use Kafka well in a saga:

publish domain events, not technical noise
include stable correlation identifiers
make consumers idempotent
design for duplicates and reordering
distinguish command topics from event topics if needed
retain enough history for replay and investigation
use an outbox pattern to avoid dual-write inconsistencies

The outbox pattern is particularly important. If a service updates its local database and publishes to Kafka in separate steps, one can succeed while the other fails. Then the saga state drifts from reality. Writing the event into an outbox table within the same local transaction, then asynchronously publishing it, is often the practical answer.

Domain semantics first

This is the point many teams miss.

A saga should model domain milestones, not just integration steps. “Inventory reservation expired” is a domain event. “HTTP 504 from inventory endpoint” is not. The latter may influence retry logic, but it should not leak into the core language of the process.

This is straight DDD thinking. Ubiquitous language matters because sagas sit across contexts. If you do not get the words right, teams implement conflicting semantics under the same labels. “Confirmed,” “authorized,” “booked,” and “submitted” are not synonyms once auditors and customers get involved.

Migration Strategy

Nobody sensible replaces a working monolith with a fully distributed saga platform in one move. That is not modernization. That is self-harm.

The practical path is progressive strangler migration.

Start by identifying one long-running process that already spans awkward module boundaries or external systems. Orders are classic. Claims processing, customer onboarding, loan origination, and returns management are also good candidates. Then carve out one bounded context at a time, preserving behavior while gradually externalizing coordination.

A typical progression looks like this:

Model the process explicitly inside the monolith

Before splitting services, make the process visible. Introduce states, transitions, timeouts, and compensation semantics in one codebase. If you cannot model it clearly in the monolith, distributing it will only make confusion travel faster.

Extract one participant service

Inventory or payments is often first. Keep the process coordinator in the monolith initially. Replace internal calls with commands/events to the new service.

Introduce durable messaging

Add Kafka or another broker for asynchronous communication where response times and resilience demand it. Use correlation IDs from day one.

Add an outbox and idempotent consumers

This is the plumbing that stops distributed systems from lying to you.

Move orchestration into a dedicated service

Only once the process logic is stable and external interactions are understood should you extract the orchestrator. At that point the monolith becomes one participant among others.

Strangle remaining dependencies

Fulfillment, notification, finance, fraud, customer preferences—pull them out as bounded contexts when the business case is real.

This migration path works because it respects both technical and organizational gravity. Teams can learn the domain semantics and operational behavior incrementally. Architecture should absorb change, not stage a coup.

Reconciliation during migration

Reconciliation deserves its own section because enterprises live on it, even when architects pretend they do not.

In the early migration stages, some systems will remain synchronous, some event-driven, and some batch-fed. There will be drift. Messages will arrive late. Downstream systems will process duplicates. Legacy platforms may only expose nightly exports. The answer is not to demand perfection. The answer is to build reconciliation as a first-class capability.

That means:

authoritative process state with audit history
periodic scans for incomplete or inconsistent sagas
reports of orphaned reservations, uncaptured authorizations, and unfulfilled shipments
compensating actions or manual work queues
replay support from Kafka or event stores
business-owned exception handling policies

Reconciliation is what turns sagas from elegant diagrams into enterprise-grade systems. Every architect loves the happy path. Operations pays for the other paths.

Enterprise Example

Consider a global retailer modernizing order processing across e-commerce, stores, and third-party marketplaces.

Originally, one commerce platform handled everything in a relational database. As channels expanded, the retailer split capabilities into bounded contexts:

Order Management
Inventory Availability
Payments
Fulfillment
Customer Communications
Finance Posting

Inventory availability was moved first because stock logic differed sharply by region and channel. Payments followed due to PCI isolation needs. Kafka became the backbone for events across regions. A dedicated order saga orchestrator was introduced only after the first two service extractions revealed how much implicit process logic had been hiding in the old application.

The core order saga looked roughly like this:

accept customer order
reserve inventory in relevant fulfillment node
authorize payment
create fulfillment request
confirm order
send notification
post finance event

But the real value came from modeling the exceptions:

if inventory reservation fails, reject or backorder based on product policy
if payment authorization fails, release reservation
if fulfillment creation times out, hold order in review state and retry
if a cancellation arrives before shipment, issue release and void authorization
if shipment already occurred, trigger return workflow and refund path

A specific failure mode exposed the importance of domain semantics. In one country, payment providers returned asynchronous authorization results several seconds after the initial request. Early designs treated this as a technical timeout and retried, producing duplicate authorizations. The corrected model introduced a distinct business state: Payment Authorization Pending. Once that existed, the whole process became cleaner. Fulfillment would not proceed until the payment state settled, and the UI could inform customers honestly instead of faking an instant result.

That is a good example of architecture improving once the domain language gets serious.

Operationally, the retailer also implemented a reconciliation service that scanned for:

inventory reserved but no payment outcome after 15 minutes
payment authorized but no fulfillment request after 5 minutes
shipment created for canceled orders
finance posting missing after confirmation

None of those conditions are exotic. They are what a distributed order pipeline looks like on Tuesday.

Operational Considerations

A saga architecture lives or dies in operations.

Observability

You need end-to-end traceability by saga instance, not just per service. Every command, event, state transition, retry, timeout, and compensation should be correlated. A support analyst should be able to look up an order and see a coherent timeline.

At minimum, capture:

saga ID
business key such as order number
current state
transition history
last successful step
pending step
retry count
timeout deadlines
compensation actions taken

Logs alone are not enough. Build a queryable process view.

Idempotency

Messages will be delivered more than once. Consumers must tolerate duplicates. Commands should carry unique identifiers; handlers should detect prior processing; side effects should be safe to repeat or explicitly guarded.

Without idempotency, retries become a source of corruption.

Timeouts and deadlines

Long-running processes require clocks. Inventory reservations expire. Payment authorizations have validity windows. Customer confirmation emails can be delayed, but compliance notifications may have legal deadlines. The saga must model these explicitly.

Human intervention

Some failures need a person. Fraud review, manual stock override, customer service resolution, and partner outage workarounds all happen in real enterprises. A mature saga platform supports suspended states, operator actions, and resumed flows. If your architecture assumes zero human intervention, it was designed for a demo.

Versioning

Process definitions evolve. New steps are added. Policies change by market. Existing in-flight sagas do not disappear just because a team deployed new code. Version the saga definition or transition logic carefully. Running instances must continue coherently under the rules they started with, or have a safe migration path.

Tradeoffs

Sagas solve one class of problem by embracing another.

They reduce the need for distributed transactions, but increase process complexity. They improve autonomy between services, but require stronger discipline in domain modeling. They scale operationally better than global locking, but demand better observability, retries, compensation design, and reconciliation.

There is no free lunch here. There is only a different bill.

Orchestration centralizes logic and visibility, but can create a dependency bottleneck if every process change goes through one team. Choreography supports looser coupling, but can dissolve into distributed guesswork. Asynchronous messaging improves resilience, but complicates user experience and state communication. Compensation is often possible, but never as clean as rollback.

The right question is not whether these tradeoffs exist. It is whether they fit the business.

Failure Modes

Sagas fail in predictable ways.

The fake rollback fantasy

Teams assume every action can be compensated. It cannot. Some actions are irreversible, expensive, or only partially reversible. Shipping, settlement, legal notifications, and third-party side effects all have real-world consequences. Model that honestly.

Anemic domain events

If events are just technical breadcrumbs, the process becomes fragile and unreadable. Step3Completed is not a domain concept. It is a cry for help.

Central orchestrator as a god object

A poorly designed orchestrator accumulates all business logic for all domains. Then every team depends on it, and bounded contexts become implementation details. The orchestrator should coordinate, not own everyone else’s model.

No reconciliation path

This is the classic enterprise mistake: elegant eventing, no cleanup. Drift accumulates until operations invents spreadsheets and manual fixes. At that point the real system is the spreadsheet.

Misunderstood ordering guarantees

Kafka preserves order within a partition, not across the universe. If the design quietly depends on global sequencing, it will fail under scale.

Timeout confusion

A timeout does not mean failure. It means uncertainty. Good designs preserve that distinction and move into pending or review states when necessary.

When Not To Use

Sagas are not a universal answer.

Do not use them when a single transactional boundary is sufficient and likely to remain so. A well-structured modular monolith with one database can handle many business processes more simply and more reliably than a constellation of services tied together by wishful events.

Do not use them for short, tightly coupled operations where immediate consistency is mandatory and all data lives in one place. A local transaction is the right tool there.

Do not introduce sagas just because Kafka has arrived in the platform roadmap. Messaging infrastructure does not create a business need for distributed process management.

And do not use sagas where the business cannot define acceptable intermediate states or compensation policies. If the organization insists on pretending every cross-system action is instant and atomic, the architecture conversation is not mature enough yet.

Several patterns commonly sit beside sagas.

Outbox pattern for reliable event publication alongside local state changes
Inbox pattern for idempotent message handling
Process manager as an implementation style for orchestrated sagas
Event sourcing where a full event history helps reconstruct process state, though it is not required
CQRS for exposing process read models to users and operators
TCC (Try-Confirm-Cancel) in cases where reservation semantics are stronger and participants support explicit provisional actions
Strangler fig pattern for incremental migration from a monolith or legacy workflow engine
Dead letter handling and replay for operational recovery in event-driven environments

These patterns are complements, not substitutes. You can have event-driven microservices without sagas. You can have a saga without event sourcing. The architecture should fit the process, not the conference talk.

Summary

Sagas are a disciplined way to model long-running business processes in distributed systems. They work because they align technology with a basic enterprise truth: meaningful work happens over time, across boundaries, under uncertainty.

The pattern is most effective when grounded in domain-driven design. Bounded contexts define local authority. Domain events express real business milestones. Compensation reflects business recovery, not technical fantasy. Reconciliation acknowledges that distributed systems drift and must be brought back into line.

Use orchestration when visibility, branching, and control matter. Use choreography carefully where flows are simpler and autonomy is high. Use Kafka as transport and history, not as a replacement for process thinking. Migrate progressively with a strangler approach, making the process explicit before distributing it. Build operational tooling as if failure is normal, because it is.

In the end, sagas are less about handling failure than about respecting reality. A long-running process is not a transaction stretched thin. It is a story with chapters, pauses, reversals, and consequences. Good architecture does not hide that story. It gives it structure.

Frequently Asked Questions

What is CQRS?

Command Query Responsibility Segregation separates read and write models. Commands mutate state; queries read from a separate optimised read model. This enables independent scaling of reads and writes and allows different consistency models for each side.

What is the Saga pattern?

A Saga manages long-running transactions across multiple services without distributed ACID transactions. Each step publishes an event; if a step fails, compensating transactions roll back previous steps. Choreography-based sagas use events; orchestration-based sagas use a central coordinator.

What is the outbox pattern?

The transactional outbox pattern solves dual-write problems — ensuring a database update and a message publication happen atomically. The service writes both to its database and an outbox table in one transaction; a relay process reads the outbox and publishes to the message broker.