⏱ 19 min read
Distributed transactions are the kind of thing architects inherit the way coastal towns inherit storms. Nobody asks for them. They arrive with growth, with acquisitions, with teams splitting systems into services, with a well-meaning push toward “autonomy” that quietly dissolves the old guarantees of a single database commit.
In the monolith, life was vulgar but simple: open transaction, update records, commit, go home. In microservices, that trick stops working. The data you need lives in different stores, behind different APIs, owned by different teams with different release calendars and different tolerances for operational pain. The old answer—two-phase commit, that stern bureaucrat of distributed systems—promised order through coordination. In practice, it often delivered latency, lock contention, infrastructure coupling, and a system that could fail in ways both impressive and expensive.
So the modern question is not “How do I preserve the illusion of one ACID transaction across everything?” The better question is: How do I model business consistency in a world where technical consistency is inevitably fragmented?
That is where architects earn their keep.
This article walks through how to handle distributed transactions in microservices without 2PC, using domain-driven design, asynchronous messaging, compensation, reconciliation, and progressive migration patterns. The saga pattern is part of this story, but not the whole of it. The interesting architecture lives one level deeper—in the semantics of the business process, in the design of bounded contexts, in the operational discipline around Kafka and event-driven systems, and in the blunt acceptance that some things will fail halfway through.
Because they will.
Context
Microservices changed the shape of transaction boundaries.
Once each business capability had its own database, “save everything together” stopped being a technical primitive and became an architectural aspiration. Order management owns orders. Billing owns invoices. Inventory owns stock reservations. Shipping owns fulfillment. Customer service owns returns. Each context has its own model, its own lifecycle, and often its own storage technology. That is healthy. It is also the end of pretending that a single commit can span the enterprise.
This is where domain-driven design matters. Not as a ceremony. Not as a workshop with sticky notes that vanish into a Confluence page. But as a way to decide what must be consistent now, what may be consistent later, and which language belongs to which model.
If your “Order” means one thing in sales, another in warehouse operations, and another in finance, then your transaction problem is not only technical. It is semantic. You cannot coordinate what you have not defined.
Many distributed transaction failures are really modeling failures in disguise. Teams try to stitch together one giant cross-service workflow because the business process looks linear on a PowerPoint slide. But underneath, the actual business has stages, commitments, reservations, reversals, tolerances, and exceptions. A customer order is not “all or nothing” in the same way a database row update is. It may be accepted pending payment. Payment may be authorized but not captured. Stock may be reserved but later released. Shipment may be split. Refund may be partial.
That is not weakness. That is the domain speaking.
Problem
The core problem is straightforward:
How do multiple microservices complete a business transaction without a shared ACID transaction coordinator?
The traditional answer, two-phase commit (2PC), works by adding a coordinator that asks all participants to prepare and then commit. It sounds clean. Architects love clean diagrams. Operators do not love running them.
In a real enterprise microservices estate, 2PC introduces several pathologies: microservices architecture diagrams
- Tight runtime coupling between services and infrastructure
- Increased latency because every participant must synchronously coordinate
- Reduced availability due to blocking behavior
- Difficult failure recovery when the coordinator or participant crashes mid-flight
- Poor fit with Kafka, event streaming, NoSQL databases, cloud-native platforms, and polyglot persistence
- Organizational friction because independent teams must conform to one transaction protocol
And perhaps the biggest problem: 2PC treats business operations like low-level resource locking.
That is often the wrong abstraction.
A distributed business transaction is rarely about “commit every row or roll back every row.” It is usually about moving through a sequence of meaningful domain states while preserving business invariants strongly where needed and eventually where acceptable.
The difference matters.
Forces
Architectural decisions here are shaped by competing forces, and pretending otherwise leads to brittle designs.
1. Business consistency vs technical consistency
Some invariants are non-negotiable. You do not want to charge twice. You do not want to ship an order that was cancelled. You do not want to promise inventory you never had.
But not every invariant needs synchronous coordination. “Order visible in customer history within 200 milliseconds” is usually not the same class of requirement as “payment authorization must exist before shipment release.”
2. Autonomy vs coordination
Microservices promise team autonomy. Distributed transactions demand coordination. The more strongly you coordinate at runtime, the less autonomy you really have. The architecture must choose where coupling belongs: at design time, at event contract boundaries, or at runtime through synchronous orchestration.
3. Availability vs immediacy
If a downstream service is unavailable, should the whole business process stop? Sometimes yes. Often no. Mature enterprises value graceful degradation. That means accepting eventual consistency and designing ways to reconcile.
4. Throughput vs control
Kafka, event-driven architecture, and asynchronous processing can produce massive throughput and resilience. They also make process visibility harder. You trade immediate certainty for decoupling and scalability.
5. Simplicity vs correctness over time
A synchronous HTTP call chain is easy to draw and easy to demo. It is also a fine way to create cascading failures. An event-driven transactional workflow is harder to reason about initially, but often more robust at scale.
6. Domain truth vs integration fiction
The hardest force is semantic. If your architecture forces all services to share one transaction notion, you flatten the domain into a lie. Good distributed transaction design respects bounded contexts. One service confirms an order. Another authorizes funds. Another reserves stock. Those are related actions, not one magical operation.
Solution
The practical alternative to 2PC is not one pattern but a family of patterns built around local transactions, reliable messaging, compensation, and reconciliation.
The headline concept is familiar: Saga.
But calling it “use saga” is like answering “How should we build a city?” with “roads.” True, but not enough.
A robust architecture for distributed transactions without 2PC usually combines:
- Local ACID transaction within each service
- Transactional outbox to publish events reliably after local commit
- Kafka or similar event broker for asynchronous propagation
- Process coordination, either orchestration or choreography
- Compensating actions where business reversal is possible
- Idempotent consumers to handle duplicate delivery
- State machine thinking for long-running business processes
- Reconciliation mechanisms for drift, missed messages, and partial failure
- Domain-specific invariants split into immediate and eventual guarantees
That is the real shape of the solution.
The key idea
Each service performs its own local transaction and emits a domain event that reflects what happened in its bounded context. Other services react and perform their own local transactions. If something fails later, the system does not “roll back” globally. It moves forward with compensating actions or marks the process for manual or automated reconciliation.
This is not a technical compromise. It is a business process architecture.
Why this works
Because most enterprise workflows are naturally long-running and stateful already. The system merely stops pretending otherwise.
An order pipeline, claims process, loan approval, telecom provisioning flow, or insurance endorsement process is not a single indivisible action in the business. It is a sequence of commitments under uncertainty. Distributed transaction architecture should reflect that.
Architecture
There are two main saga styles: orchestration and choreography. Both can work. Both can fail spectacularly if used carelessly.
Orchestrated flow
An orchestrator explicitly drives the process: create order, request payment authorization, request inventory reservation, then confirm or compensate.
Orchestration gives you visibility and control. It is usually the better fit when the workflow is important, regulated, long-running, or difficult to reason about through pure event reactions.
My bias is simple: if the business process has a name, KPIs, and operational dashboards, it probably deserves orchestration.
Choreographed flow
In choreography, services react to events and trigger the next step themselves. There is no central coordinator.
Choreography is more decoupled and can feel more “microservice-pure.” It also tends to produce invisible process logic spread across multiple services. Over time, this becomes archaeology. Teams discover business flow by reading code, topic subscriptions, and tribal memory.
That is not architecture. That is folklore.
Transactional outbox
Without a transactional outbox, your event-driven transaction design is built on sand.
The local database change and the message publication must be tied together, not as a distributed transaction, but by writing the event to an outbox table in the same local transaction as the business state change. A relay then publishes that event to Kafka. event-driven architecture patterns
This matters because there are only two truly dangerous milliseconds in distributed systems: the ones after your database commits and before your event is published. That tiny gap can bankrupt correctness.
The outbox pattern closes it.
Domain semantics and state modeling
Domain-driven design provides the language for these transactions. The process should not be modeled as “Step 1, Step 2, Step 3” detached from meaning. It should be modeled in domain terms:
- Order Submitted
- Payment Authorized
- Inventory Reserved
- Fulfillment Released
- Order Cancelled
- Refund Issued
Those states are not cosmetic. They are the control surface for distributed consistency.
You should classify transitions into:
- Irreversible actions: shipment dispatched, external bank transfer settled
- Compensatable actions: stock reservation release, payment authorization void
- Retryable actions: transient broker failure, timeout to downstream dependency
- Reconcilable states: order pending too long, payment captured with missing order confirmation
This is where good architecture departs from simplistic saga tutorials. Compensation is not rollback. A refund is not the inverse of a payment in accounting terms. Releasing stock after a failed payment is not equivalent to never reserving it. The domain semantics are asymmetric, and your model must respect that.
Migration Strategy
Most enterprises are not starting from a greenfield Kafka-based microservices landscape. They are dragging decades of transaction assumptions out of monoliths, ESBs, shared databases, and brittle integration layers. So the migration question is not optional.
The right migration strategy is usually progressive strangler, not replacement.
Start by finding the real transaction boundary
In the monolith, one database transaction may wrap several business capabilities. That does not mean it should continue to do so. Break it apart by bounded context, not by table ownership alone.
For example, an old ERP module may update order, stock, invoice, and shipment tables in one transaction. In the target architecture, that likely becomes separate services with event-based coordination. But before splitting code, split meaning:
- Which invariants must stay synchronous?
- Which actions are commitments versus requests?
- Which failures can be compensated?
- Which require human review?
Introduce domain events inside the monolith first
A smart migration often begins by creating explicit domain events and process state inside the existing system before any physical service split. This is less glamorous than announcing “we now have microservices,” but far more effective.
The monolith can write to an outbox and publish OrderCreated, PaymentRequested, StockReserved events. This surfaces process semantics early and lets downstream capabilities be strangled out one by one.
Use anti-corruption layers
Legacy systems rarely speak in domain language cleanly. Use anti-corruption layers to translate between old schemas, integration protocols, and new bounded contexts. Otherwise, the legacy model leaks everywhere and poisons the new architecture.
Migrate in slices of business capability
A progressive strangler migration may look like this:
- Monolith still owns end-to-end order fulfillment.
- Introduce outbox and Kafka topics for order lifecycle events.
- Extract Payment Service behind anti-corruption layer.
- Extract Inventory reservation next.
- Move process coordination to orchestrator or workflow service.
- Retire monolith transaction path gradually.
- Add reconciliation and operational dashboards before full cutover.
The sequence matters. Teams often extract services before they have operational observability, idempotency, or replay strategy. That is how distributed transaction design turns into distributed confusion.
Enterprise Example
Consider a global retailer processing online orders across regions, warehouses, and payment providers.
In the old architecture, a central commerce platform ran a single relational database transaction to create the order, decrement available stock, register a payment instruction, and trigger fulfillment. This worked until scale, region-specific payment gateways, and warehouse autonomy made the model unmanageable. The system became slow during peaks, impossible to evolve independently, and deeply coupled to a primary database cluster nobody wanted to touch.
The target architecture introduced these bounded contexts:
- Order Management
- Payments
- Inventory
- Fulfillment
- Customer Notification
Kafka became the event backbone. Each service owned its state and published domain events through a transactional outbox.
The distributed business transaction looked roughly like this:
- Order Management accepts the customer order and sets status to
PENDING_VALIDATION. - It emits
OrderPlaced. - Payments consumes
OrderPlacedand requests authorization from regional PSPs. - If successful, Payments emits
PaymentAuthorized; otherwisePaymentDeclined. - Inventory consumes
PaymentAuthorizedand attempts reservation at the optimal warehouse. - If reservation succeeds, Inventory emits
InventoryReserved. - Fulfillment consumes
InventoryReservedand creates pick/pack work. - Order Management consumes the relevant events and marks the order
CONFIRMED. - If any step fails, compensating actions begin:
- void payment authorization if inventory cannot be reserved
- release reservation if downstream fulfillment rejects
- notify customer and customer service systems
What made this architecture successful was not just the event flow. It was the discipline around the awkward cases.
Reconciliation was first-class
A nightly and near-real-time reconciliation process compared:
- authorized payments with missing order confirmation
- reserved stock with cancelled orders
- shipment requests without confirmed payment capture
- stale
PENDING_VALIDATIONorders older than SLA
This mattered because no matter how clean your event architecture is, brokers have incidents, consumers get stuck, partner APIs time out, and teams deploy bugs on Fridays despite every warning ever written.
Business semantics drove compensation
If payment was only authorized, it could be voided.
If payment was captured, a refund flow was required.
If pick-pack had begun, inventory release had different semantics.
If shipment label creation succeeded but carrier handoff had not happened, the process could still cancel without customer-facing shipment.
That is not “error handling.” That is the domain.
They did not use choreography everywhere
For low-value notifications, choreography was enough. For the core order-to-cash process, they introduced an orchestration layer because operations needed one place to see where an order was stuck and why. Once customer support asks “where is this order in the process?”, ideology about decentralized purity evaporates quickly.
Operational Considerations
The architecture is only half the story. Distributed transactions fail operationally long before they fail conceptually.
Idempotency is mandatory
Kafka can deliver duplicates. Consumers can crash after processing but before committing offsets. Producers can retry. If your handlers are not idempotent, “event-driven consistency” quickly turns into “event-driven overcharging.”
Use idempotency keys, deduplication tables, aggregate version checks, or naturally idempotent commands where possible.
Ordering is local, not global
Architects often assume event order as if the universe is considerate. It is not. Kafka gives ordering within a partition, not across topics or services. Design aggregates and partitions so the required order is preserved for the business entity that matters—usually by order ID, account ID, or reservation ID.
Timeouts and stuck processes
A distributed transaction can stall indefinitely unless explicitly modeled otherwise. Every long-running process should have:
- expected SLA
- timeout handling
- dead-letter strategy
- escalation path
- manual intervention tooling
There should be a visible difference between “waiting normally” and “stuck.”
Observability must follow the business process
Logs are not enough. Metrics are not enough. Traces are useful but still not enough. You need process observability:
- number of orders in each state
- age distribution per state
- compensation rate
- reconciliation backlog
- poison message trends
- downstream dependency failure rates
A healthy system can answer: How many business transactions are incomplete, why, and since when?
Replays are power tools
Kafka replay is valuable. It is also dangerous. Reprocessing old events without idempotency, versioning discipline, or bounded replay strategy can create a second outage while fixing the first.
Schema evolution
Domain events are contracts. Treat them with the same seriousness as public APIs. Use versioning strategies that allow consumers to evolve independently. Avoid leaking internal data models into event payloads.
Tradeoffs
No serious architecture choice comes free.
What you gain
- Better scalability and service autonomy
- Better fit for cloud-native and polyglot persistence
- Higher availability under partial failure
- More natural mapping to long-running business workflows
- Decoupling from XA/2PC-capable infrastructure
- Better resilience when built with outbox, retries, and reconciliation
What you pay
- More design complexity
- Eventual consistency that product and business teams must understand
- Operational overhead in observability and reconciliation
- Harder debugging across service boundaries
- More demanding domain modeling
- Compensations that are often messy and asymmetric
This is the fundamental trade: you remove hidden infrastructure complexity by accepting explicit business process complexity.
That is usually a good trade. Usually.
Failure Modes
Distributed transactions without 2PC do not fail in one neat way. They fail as a family.
Lost event after local commit
If you update state but fail to publish the event, downstream services never react. This is why the outbox pattern is so important.
Duplicate event processing
A consumer processes the same event twice, causing double reservation, duplicate emails, or repeated charges. Idempotency is the antidote.
Compensation failure
The initial action succeeds, the downstream step fails, and then the compensating action also fails. This is common and under-discussed. Your architecture must support retries, alerting, and potentially manual intervention.
Semantic mismatch
A team defines OrderConfirmed as “payment authorized,” another interprets it as “ready to ship.” This is a classic bounded context failure. Ubiquitous language is not a luxury.
Choreography spaghetti
Too many services react to too many events, and nobody can explain the process anymore. The architecture becomes a haunted house of subscriptions.
Reconciliation drift
Small inconsistencies accumulate faster than reconciliation can clear them. This usually means the design underestimates the volume or class of failures.
Orchestrator as a central bottleneck
If orchestration is overused, badly implemented, or made too smart, it becomes a distributed monolith in nicer clothing.
Human process gap
Some failures need human decisions—approve refund, re-route inventory, contact customer. If the architecture has no operational workflow for this, incidents linger in limbo.
When Not To Use
This approach is powerful. It is not universal.
Do not use it when a modular monolith will do
If your teams are small, the domain is tightly coupled, and one transactional database remains practical, splitting into microservices just to earn distributed transaction problems is architectural self-harm.
Do not use eventual consistency where hard synchronous consistency is truly required
Some domains—certain financial ledger postings, specific securities operations, limited high-integrity inventory contexts—may still require stronger transactional guarantees than sagas and compensations can comfortably provide.
Even then, be precise. The answer may be to keep a narrower consistency boundary inside one service, not to drag 2PC across the whole estate.
Do not compensate what cannot be meaningfully reversed
If the external world has already changed irreversibly—money settled, package delivered, legal filing submitted—then “rollback” is fantasy. You need explicit corrective workflows, not simplistic compensation.
Do not choose choreography for complex regulated workflows with poor visibility
If auditability, control, and operator insight are central, an orchestrated process manager is usually the safer choice.
Do not use Kafka as a magic wand
Kafka is a backbone, not absolution. It does not solve bad domain boundaries, unclear semantics, or missing operational discipline.
Related Patterns
Distributed transactions without 2PC sit inside a broader toolbox.
- Transactional Outbox: reliable event publication after local commit
- Inbox Pattern: deduplicate and track consumed messages
- Process Manager / Orchestrator: explicit coordination of long-running workflow
- Event Sourcing: useful in some domains, but not required
- CQRS: often paired with event-driven flows for read model separation
- TCC (Try-Confirm-Cancel): useful where resources support explicit provisional reservation
- Reservation Pattern: inventory, capacity, credit hold
- Reconciliation Batch / Streaming Reconciliation: detection and repair of drift
- Strangler Fig Pattern: progressive migration from monolith to microservices
- Anti-Corruption Layer: isolate legacy semantics during migration
A useful mental model is this: saga is the visible choreography or orchestration of state transitions, while outbox, idempotency, reconciliation, and process observability are the plumbing that keeps the house standing.
Summary
The old dream of a single distributed commit across microservices is seductive because it seems to preserve certainty. But certainty bought through 2PC often comes at the price of autonomy, availability, and operational sanity. In modern enterprise systems—especially those built around Kafka, independently deployable microservices, and domain-owned data—the better path is usually not stronger coordination. It is better modeling.
Model the business transaction as a sequence of meaningful domain states. Keep ACID local to each bounded context. Publish events reliably with a transactional outbox. Use orchestration where visibility matters and choreography where the flow is simple and local. Design compensations as real business actions, not pretend rollbacks. And assume from day one that reconciliation will be needed, because real systems drift.
That last part is worth remembering.
A distributed transaction architecture is not defined by how elegantly it succeeds. It is defined by how honestly it fails.
If you design for the awkward middle—partial completion, duplicate messages, stuck workflows, semantic disagreements, legacy coexistence—you can build microservices that behave like an enterprise rather than a demo.
And if you cannot tolerate those realities, then the answer may not be a cleverer saga. It may be a smaller consistency boundary, a modular monolith, or simply the courage to say no to needless distribution.
That, too, is architecture.
Frequently Asked Questions
What is a service mesh?
A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.
How do you document microservices architecture for governance?
Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.
What is the difference between choreography and orchestration in microservices?
Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.