⏱ 20 min read
Microservices are supposed to buy us freedom. Teams move faster. Deployments get smaller. Boundaries become clearer. In the sales pitch, a service is a tidy thing: it owns its data, exposes a contract, and minds its own business.
Then the real enterprise arrives.
Billing needs customer status from CRM. CRM needs payment standing from Billing. Order Management cannot confirm shipment until Inventory reserves stock, but Inventory cannot reserve stock until Order Management finalizes the order. Risk wants the latest account exposure before approval, but exposure depends on pending transactions still flowing through the workflow engine. The architecture that began as a constellation quietly turns into a knot.
And that knot has a name: the service dependency cycle.
This is one of the most common, least honestly discussed failure modes in microservices architecture. Teams spend months decomposing a monolith, only to recreate the same coupling through network calls, asynchronous topics, shared schemas, and “temporary” orchestration logic. They replace compile-time dependency with runtime dependency. The shape changes. The pain does not. microservices architecture diagrams
A circular dependency graph in a microservices estate is not merely a diagramming problem. It is a semantic problem. It tells you the business capabilities have been split in ways the domain does not support, or the workflow has been designed with mutual knowledge where there should be clear ownership, or the organization has let convenience outrun discipline. Usually, it is all three.
The uncomfortable truth is simple: cycles are often architecture’s way of saying your bounded contexts are lying.
This article looks at service dependency cycles from an enterprise architecture point of view: what they are, why they happen, how to detect them, how to break them, and when the cure is worse than the disease. We will lean on domain-driven design, event-driven architecture, and progressive strangler migration. We will also talk about reconciliation, because any architect who discusses distributed state without reconciliation is writing fiction.
Context
In a healthy microservices architecture, dependencies form a graph that mostly points in one direction. There may be layers, or capability-aligned domains, or event flows moving outward from systems of record to downstream consumers. But the graph should permit independent change. If Service A needs Service B, and Service B needs Service A in order to complete its own work, independence is already compromised.
Some teams only think of circular dependency in synchronous terms:
- Service A calls Service B
- Service B calls Service A
That is the obvious version, and often the least subtle.
In enterprise systems, cycles show up in several forms:
- Synchronous API cycles: REST or gRPC services directly calling one another.
- Asynchronous process cycles: Service A emits an event consumed by B, which emits an event consumed by A, where both depend on the other to reach a terminal business state.
- Data dependency cycles: Two services maintain local truth but must continuously read or replicate each other’s state to make decisions.
- Workflow cycles: A process engine or saga repeatedly loops between services because ownership of decisions is unclear.
- Schema and contract cycles: Teams version APIs or events together because changes cannot be made independently.
The presence of Kafka does not rescue you from this. Kafka can decouple transport. It does not decouple semantics. A circular dependency graph built on topics is still circular. It just fails more politely. event-driven architecture patterns
This matters because the promise of microservices rests on one thing more than any other: autonomous evolution. If a service cannot reason, deploy, or recover without another service in the loop, autonomy is theater.
Problem
A dependency cycle means no participant can progress cleanly without the others. That creates a set of familiar symptoms:
- cascading latency
- deadlocks in business workflow
- deployment coordination across teams
- brittle retries and duplicate processing
- distributed transactions by stealth
- change paralysis
The deeper problem is not technical. It is conceptual.
When two services mutually depend on one another to determine truth, one of them does not actually own its domain concept. Or both own overlapping slices of the same concept. This is why cycles so often emerge around nouns the business treats as singular:
- Customer
- Order
- Account
- Payment
- Inventory
- Policy
- Claim
The business says “customer” as if it were one thing. The enterprise landscape turns it into six services and fourteen interpretations. Then each service starts asking the others what a customer “really” is. That is not a service problem. That is a domain semantics problem.
A microservices architecture becomes unstable when it splits a concept before it splits the language.
Typical circular dependency graph
This kind of graph looks manageable on a slide. In production, it means one degraded service can stall an entire value stream.
Forces
Architects do not create cycles because they are foolish. They create them because the forces are real.
1. Natural business processes are cross-functional
Most meaningful business processes span domains. Order-to-cash touches sales, pricing, inventory, fraud, billing, shipping, tax, and customer service. If services are aligned to business capabilities, there will be interactions. The question is not whether they interact. The question is whether the interaction preserves clear ownership.
2. Teams optimize locally
A team under pressure will call another service rather than redesign a workflow. It feels pragmatic. “We just need one endpoint.” Enterprises are built out of these sentences.
3. Bounded contexts are drawn too mechanically
Many decomposition efforts carve services from the old application structure, or from data entities, or from org charts. That creates services around tables instead of business decisions. A service that owns records but not behavior inevitably reaches outward to ask others how to behave.
4. Read-time composition is seductive
Fetching the latest information on demand sounds safe. No stale data, no duplication, no events to reconcile. Until ten services join the request path and your user journey depends on all of them being healthy.
5. Event-driven architecture can hide semantic coupling
Teams often move to Kafka and assume they are decoupled because nobody is making HTTP calls. But if Service A cannot decide without an event from Service B, and Service B cannot emit that event until it hears from A, they have simply moved the cycle into time.
6. The enterprise demands consistency where the domain can tolerate delay
Not every business rule needs synchronous certainty. But stakeholders often ask for it by default. Architects who do not challenge this create systems that pursue immediate consistency at the cost of resilience.
7. Legacy migration introduces transitional duplication
During a strangler migration, new services often depend on the monolith, while the monolith is also retrofitted to depend on those new services. This is one of the fastest ways to create a cycle that nobody intended.
Solution
The solution is not “ban all dependencies.” That is cartoon architecture. Real systems interact. The goal is to eliminate mutual dependency in decision-making.
The best way to think about it is this:
> A service may inform another service. It should not complete the other service’s identity.
To break cycles, you usually need some combination of the following.
Establish sharper bounded contexts
Start with domain-driven design, not API design. Ask:
- What business decision is made here?
- Who owns the language for that decision?
- Which service is the system of record for that fact?
- Which facts are reference data versus authoritative state?
- Which decisions require freshness, and which can tolerate propagation delay?
A bounded context should own behavior and meaning, not just persistence. If two services both decide whether an order is valid, your problem is already visible.
Prefer upstream ownership and downstream consumption
A clean dependency structure often has an upstream service publishing facts and downstream services deriving behavior from them. The upstream service should not query all consumers in order to know what it means.
For example:
- Customer service publishes
CustomerSuspended - Order service consumes that event and prevents new orders
- Customer service does not ask Order service whether the customer is suspended
That sounds obvious, yet many architectures do the reverse.
Replace request-time dependency with replicated domain facts
If a service needs another service’s state only to enforce a local rule, it may be better to replicate the needed facts through events and keep a local projection. This is one of the main uses of Kafka in enterprise microservices: not just messaging, but controlled distribution of business facts.
You trade freshness for autonomy. Often that is a good trade.
Introduce process orchestration only where ownership is cross-domain
If a workflow truly spans multiple bounded contexts and no single domain should own it, use an orchestrator or saga coordinator. But be disciplined: orchestration should coordinate outcomes, not absorb domain logic that rightly belongs inside services.
A saga can prevent direct cycles, but it can also become a dumping ground for indecision.
Separate commands from queries semantically
Many cycles happen because one service asks another both to do something and to explain current truth. CQRS-style separation helps: commands go to the owner; queries use projections designed for consumers. The model used for operational decisioning should not necessarily be the one used for broad read composition.
Accept eventual consistency and design reconciliation
This is where grown-up architecture starts. Once you remove synchronous cycles, you accept that state may diverge temporarily. So you need reconciliation:
- periodic comparison of source-of-truth and local projections
- replay from Kafka topics
- idempotent consumers
- compensating actions for missed events
- business-visible exception queues
A distributed system without reconciliation is just optimism in code.
Architecture
A practical target architecture aims for directional flow, explicit ownership, and resilient state propagation.
Before: cyclic service collaboration
This architecture has all the usual smells:
- synchronous mutual lookups
- behavior split across services
- no clear owner for validation rules
- request paths that grow with every “small” enhancement
After: directional ownership with events and local projections
There are still dependencies here. That is fine. But the dependencies are more directional:
- Order initiates a business process.
- Inventory owns stock reservation.
- Payment owns authorization.
- Customer publishes customer standing as a domain fact.
- Services maintain local projections of needed external facts.
The subtle but important point: Inventory does not call Order to decide whether stock exists. Payment does not call Customer in the middle of every authorization if customer status can be projected locally and refreshed asynchronously.
Domain semantics matter
Suppose the business says an order is “confirmed” only when payment is authorized and inventory is reserved. Where should that truth live?
In most cases, Order Service should own order lifecycle state. Payment owns payment status. Inventory owns reservation status. Order listens for those outcomes and transitions the aggregate accordingly.
This preserves semantics:
- Payment does not decide what “confirmed order” means.
- Inventory does not decide what “ready to ship” means.
- Order composes external outcomes into its own lifecycle.
This is domain-driven design doing real work. Not sticky-note theater.
Migration Strategy
Breaking dependency cycles in a live enterprise estate is rarely a greenfield refactoring. More often, you are untangling a running machine while finance still expects month-end close to succeed.
The safest approach is progressive strangler migration.
Step 1: Map the actual dependency graph
Do not trust architecture diagrams. They are often aspirational fiction. Use:
- distributed tracing
- service mesh telemetry
- Kafka consumer/producer maps
- API gateway logs
- code dependency scanning
- deployment coupling analysis
You want to find not just technical calls, but business dependencies:
which services must be healthy for a business transaction to complete?
Step 2: Identify semantic ownership
For every cyclical edge, ask:
- What business fact is being requested?
- Who should authoritatively own that fact?
- Why is the requester asking at runtime?
- Is this a command, a query, or a disguised validation?
- Can the requester instead consume a published fact?
This is the DDD move. It often reveals that “validation” endpoints are really boundary mistakes.
Step 3: Create published domain events for stable facts
Add events for facts that are broadly useful and relatively stable in meaning:
- customer status changed
- credit limit assigned
- account frozen
- product discontinued
- order placed
- payment authorized
- invoice overdue
Use Kafka when you need durable replay, broad fan-out, and decoupled time. But do not publish internal noise. An event stream full of implementation detail is just a distributed database with extra steps.
Step 4: Build local projections in dependent services
If Order Service needs customer standing and credit band, project those facts locally. Keep the projection intentionally small. This is not data hoarding. It is autonomy by selective replication.
Step 5: Shift decision logic to the proper owner
Once the projection exists, move local rules into the consuming service. Remove the synchronous query. Keep fallback monitoring during the transition, but do not let fallback become permanent architecture.
Step 6: Introduce reconciliation
During migration, there will be drift. Events may be missed, consumers may lag, schemas may evolve badly. So add reconciliation jobs:
- compare source snapshots with local projections
- replay Kafka topics from a checkpoint
- detect orphan process instances
- identify mismatched business statuses
- route unresolved discrepancies to human operations
Step 7: Strangle old pathways gradually
Retire synchronous endpoints one by one. Add governance controls so teams cannot quietly reintroduce them. This is where architecture review should be firm. A single convenience API can reopen the cycle. EA governance checklist
Migration flow
This is the enterprise reality: temporary dual-running, event backfill, read model formation, and careful retirement.
Enterprise Example
Consider a global insurer modernizing claims processing.
The legacy platform was a large policy administration suite. The modernization program extracted separate services for:
- Policy
- Claims
- Billing
- Customer
- Fraud
- Payment
At first glance, the split looked sensible. In practice, Claims could not adjudicate a claim without checking policy status, customer identity verification, unpaid billing exposure, and fraud score. Billing, meanwhile, needed claim status to decide whether to suspend collections in certain scenarios. Payment needed claims approval status. Fraud wanted claim and billing signals before scoring. Soon enough, every major transaction involved a mesh of calls.
The dependency cycle was not just technical. It reflected a muddled understanding of the domain. The phrase “policy in good standing” meant different things to different teams:
- Policy team meant active coverage and endorsements valid.
- Billing team meant no delinquency beyond threshold.
- Customer team meant identity verified and no regulatory hold.
- Claims team meant all of the above, plus no open fraud case.
That single business phrase generated a circular dependency graph because nobody owned its meaning.
The fix was not to build a bigger API.
The insurer reworked the landscape around domain semantics:
- Policy Service owned coverage status.
- Billing Service owned delinquency status.
- Customer Service owned identity and regulatory hold status.
- Fraud Service owned investigation state and risk decisions.
- Claims Service owned claim lifecycle and adjudication policy.
Then they introduced a published fact model over Kafka:
CoverageActivated,CoverageSuspendedAccountDelinquent,AccountCurrentCustomerVerified,RegulatoryHoldPlacedFraudCaseOpened,FraudCaseClearedClaimSubmitted,ClaimApproved,ClaimRejected
Claims consumed these events into a local eligibility projection. That projection did not try to become a master customer or billing record. It simply held the facts needed for claims adjudication. Most validations were then local. Only a small number of operations remained synchronous, mostly commands to initiate actions in other domains.
The business gained three things:
- Faster claim handling because user journeys no longer waited on multiple downstream reads.
- Clearer accountability because each status had an owner.
- Recoverability because Kafka replay and reconciliation could rebuild dependent projections.
What remained hard? Reconciliation. Always reconciliation.
When Billing corrected delinquency retroactively, some claims had already progressed under an outdated projection. The insurer handled this with compensating processes: reopen review, place payment hold, or escalate to manual assessment. That is not architectural failure. That is distributed business truth managed honestly.
Operational Considerations
Architects often stop once the boxes and arrows look cleaner. Operations is where the bill comes due.
Observability of dependency shape
You need visibility not just into service health, but into dependency structure:
- which APIs are hot paths
- where retries create feedback loops
- Kafka consumer lag by domain event
- stale projection age
- saga timeout rates
- reconciliation backlog
A cycle broken in code can reappear operationally through retry storms and fallback logic.
Idempotency and duplicate handling
Once you use asynchronous propagation, duplicates are not an edge case. They are normal weather. Event consumers must be idempotent. Commands should have correlation IDs and deduplication where it matters. A service dependency cycle combined with non-idempotent retries is one of the fastest routes to financial defects.
Schema evolution
Published domain facts need disciplined evolution. If every event change triggers coordinated deployment, you have simply rebuilt synchronous coupling in slow motion. Use tolerant readers, versioning strategy, and governance over event semantics. ArchiMate for governance
Reconciliation as a first-class capability
Reconciliation is not an admin script hidden in a wiki. It deserves architecture:
- source-of-truth snapshots
- comparison jobs
- replay mechanisms
- discrepancy queues
- operator dashboards
- audit trails
If you depend on Kafka for propagation, you also depend on your ability to replay and repair.
Handling stale data
A local projection is useful only if the business can tolerate its staleness. So define freshness policies:
- max event lag for credit status
- max projection age for inventory availability
- fallback rules when stale beyond threshold
- business exception path when certainty is required
Do not leave this implicit. “Eventually consistent” is not a requirement; it is a confession unless tied to business tolerances.
Tradeoffs
Breaking cycles is good architecture, but it is not free.
What you gain
- stronger service autonomy
- fewer cascading failures
- clearer bounded context ownership
- reduced request-path latency
- more independent deployment
- better resilience during partial outage
What you pay
- more replicated data
- eventual consistency
- reconciliation complexity
- harder debugging across time
- more event governance
- process compensation logic
The common trap is to compare an elegant event-driven target state with a naive synchronous baseline and declare victory. In reality, you are trading one kind of complexity for another. You move complexity from runtime call chains into state propagation and operational repair.
That is often the right trade. But call it honestly.
Failure Modes
Even good architects break cycles badly. Here are the common failure modes.
Event spaghetti
Teams remove APIs and publish everything to Kafka. Soon no one knows which events matter, who owns semantics, or what triggers what. You have not eliminated coupling. You have hidden it in topic subscriptions.
Projection bloat
A service starts with a small local read model and slowly accumulates half the enterprise schema. This is a warning sign. The service is becoming dependent on external data beyond its actual bounded context.
Orchestrator as new monolith
In an effort to prevent peer-to-peer cycles, teams centralize all workflow in a giant orchestrator. It becomes the new brain of the estate, packed with business logic from every domain. Coupling is reduced between services but reintroduced in one place.
False ownership
A service claims to own a concept but still requires runtime approval from others to make every important decision. Ownership without decision rights is just branding.
Reconciliation ignored until incident day
Everything works in happy-path testing. Then a Kafka consumer falls behind, or a topic is misconfigured, or a schema change drops an important field, and nobody can explain which records are wrong. Enterprises pay dearly for this kind of optimism.
Legacy back-calls during strangler migration
The monolith emits events to new services, but the new services still call the monolith for key decisions. Meanwhile the monolith starts querying the new services for partial data. Congratulations: you now have a hybrid cycle with twice the failure modes.
When Not To Use
There are situations where aggressively eliminating cycles through asynchronous replication is the wrong move.
1. Small, tightly cohesive domain with a stable team
If one team owns a tightly coupled set of behaviors and deploys them together, a modular monolith may be the better answer. Breaking a naturally cohesive domain into services just creates artificial dependency management.
2. Hard real-time consistency requirements
If the business absolutely requires a single, immediate, authoritative decision across data that cannot be stale even briefly, replication may be unacceptable. Be careful here. This is rarer than stakeholders think, but it does exist.
3. Low scale, high simplicity environments
If your transaction volume is modest and operational maturity is limited, introducing Kafka, projections, sagas, and reconciliation may create more risk than a simpler synchronous design.
4. Immature domain language
If the business does not yet agree on semantics, do not freeze those semantics into a fleet of services and event contracts. Invest first in domain discovery. Otherwise you will industrialize confusion.
5. Teams unable to operate distributed systems
Event-driven decoupling without observability, schema governance, replay capability, and operational discipline is self-harm. Sometimes the honest answer is to simplify, not distribute further.
Related Patterns
Several patterns are useful when dealing with service dependency cycles.
Bounded Context
The foundational DDD pattern. If boundaries are wrong, dependency cycles are symptoms, not root causes.
Saga
Useful for coordinating long-running, cross-domain workflows without distributed transactions. Use carefully. It can remove direct cycles but also centralize too much logic.
Outbox Pattern
Critical when publishing domain events reliably from transactional changes. It reduces the risk of local state committing without corresponding event emission.
CQRS
Helpful when separating command ownership from query needs. Particularly useful for local projections and consumer-specific read models.
Anti-Corruption Layer
Important during migration, especially when the monolith’s semantics do not map cleanly to new service boundaries. It prevents legacy meaning from contaminating emerging bounded contexts.
Strangler Fig Pattern
The practical migration approach for gradually replacing legacy capabilities without a big-bang rewrite. Especially valuable when cycles involve the old core system.
Event Sourcing
Sometimes relevant, but not a default answer. It can help with replay and auditability, but it also raises the cost of the model. Use where domain history is central, not as a reflex.
Summary
Service dependency cycles in microservices architecture are rarely just technical accidents. They are signs of semantic confusion, blurred ownership, and migration shortcuts that hardened into architecture.
The fix is not purity. It is clarity.
Use domain-driven design to decide who owns meaning. Let services publish facts they truly own. Replace unnecessary request-time dependency with local projections where the business can tolerate propagation delay. Use Kafka for durable event distribution when it serves the domain, not as decoration. Introduce reconciliation because distributed truth drifts. Migrate progressively with a strangler approach, and be ruthless about retiring temporary back-calls before they become permanent.
Most of all, remember this: a circular dependency graph is not merely a graph problem. It is the architecture drawing a circle around your indecision.
Break the cycle in the domain first. The technology will follow.
Frequently Asked Questions
What is a service mesh?
A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.
How do you document microservices architecture for governance?
Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.
What is the difference between choreography and orchestration in microservices?
Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.