⏱ 21 min read
Distributed systems have a habit of lying to us. Not maliciously. Just persistently.
They promise a clean abstraction: deploy the same service in three regions, put a global load balancer in front, replicate the data, and let the platform do the rest. It sounds tidy on a slide. Then the first major incident lands. A customer in Frankfurt gets routed to Virginia because health checks looked green. Their session state lives in Dublin. Their payment authorization happened in Singapore through an asynchronous workflow. Kafka retries replay an order event after a leader election. Support sees three versions of “truth,” all technically valid, none operationally useful. event-driven architecture patterns
This is where topology-aware routing stops being an optimization and becomes architecture.
A serious multi-region system is not just a fleet spread across geography. It is a living map of latency boundaries, data gravity, sovereignty constraints, failure domains, and business semantics. Routing decisions are not merely network choices. They are domain decisions with infrastructure consequences. If your architecture does not understand topology, your incidents will.
The central idea is simple: route requests according to the topology that matters to the business and the system. Not just nearest region. Not just available region. The right region for the operation, the user, the data, the consistency requirement, and the failure state. In practice, that means routing based on domain ownership, data residency, write authority, replication lag, service dependencies, and operational posture.
This article argues for a pragmatic, domain-driven approach to topology-aware routing in multi-region systems. We will look at the problem, the forces pushing against us, a solution shape, architecture decisions, migration using a strangler pattern, operational concerns, tradeoffs, failure modes, and when this pattern is a bad fit. I’ll also walk through an enterprise example where this design matters in the most ordinary and brutal way: moving money.
Context
Most enterprises do not start with a pristine multi-region design. They inherit one.
A company grows from one market to five. Compliance enters the room. Customer latency becomes visible. A regional outage becomes board material. Acquisitions bring duplicate systems. Meanwhile, teams adopt microservices, Kafka, and managed cloud databases, often independently. Before long, “multi-region” means several things at once: microservices architecture diagrams
- active-passive disaster recovery for some systems
- active-active reads for others
- regional data residency for regulated domains
- globally distributed event streaming
- edge routing for web traffic
- private network connectivity across regions and clouds
These concerns become entangled. The network team thinks in paths. Platform engineers think in clusters. Application teams think in services. Domain teams think in customer journeys. Operations thinks in blast radius. Legal thinks in borders. The architecture only works when these views meet.
That is why topology-aware routing should be treated as a cross-cutting enterprise capability, not a clever API gateway feature.
The phrase “topology-aware” is often reduced to latency-based routing. That is too thin. Good architecture recognizes at least four kinds of topology:
- Geographic topology
Where users and systems physically are.
- Infrastructure topology
Regions, availability zones, clusters, networks, service mesh boundaries.
- Data topology
Which region is authoritative for which data, where replicas live, how stale they may be.
- Domain topology
Which bounded context owns which decisions, workflows, and invariants.
The first three are common. The fourth is the one that saves you from building an expensive mess.
Problem
A multi-region system must decide where each request should go. That sounds trivial until we ask, “Which request?” and “Go where, exactly?”
A login request, a pricing query, an order placement, a payment capture, an account transfer, and an analytics export all have different semantics. Some can tolerate stale data. Some cannot. Some must remain in-country. Some must land where the customer’s aggregate is mastered. Some depend on local Kafka consumers having applied the latest events. Some can fail over safely. Some absolutely cannot.
Without topology-aware routing, teams usually default to one of three flawed models:
- Route everything to nearest healthy region
- Pin users to a home region for everything
- Let each service decide independently
Each works for a while. Each breaks differently.
Nearest-region routing ignores authority. You get low latency on the front door and high confusion behind it. Home-region routing reduces ambiguity but often creates needless cross-region chatter for shared services and read-heavy workloads. Service-by-service routing sounds flexible, but in practice it decentralizes critical consistency and compliance logic into dozens of teams and codebases.
The result is architectural drift. A customer journey crosses regions in accidental ways. Data sovereignty rules become hardcoded exceptions. Cache invalidation gets regional variants. Retry storms amplify replication lag. During failover, operators discover that “healthy” only meant the ingress endpoint responded, not that the transaction path was complete.
This is not a routing bug. It is a missing model.
Forces
Topology-aware routing exists because several forces pull in different directions at the same time.
Latency versus correctness
Users feel latency immediately. The business feels incorrectness later and more expensively. Reading a product catalog from the nearest region is sensible. Writing a securities trade to the wrong region because it was 20 milliseconds faster is negligence dressed as performance engineering.
Availability versus consistency
Multi-region systems are often sold as a path to higher availability. True enough. But every availability gain is purchased with a consistency story. If a request can fail over anywhere, then the data and domain invariants behind that request must survive that move. Many do not.
Sovereignty versus operability
Regulated enterprises must often keep personal or financial data in specific jurisdictions. But pure residency rules can lead to fragmented operational models, duplicate tooling, and complex support paths. Routing is where legal boundaries become software behavior.
Central governance versus team autonomy
A central platform can implement global routing policy, but domain teams own the semantics of operations. The trick is to centralize policy enforcement without centralizing domain decision-making. That is classic domain-driven design territory: bounded contexts define meaning; the platform enforces mechanics.
Cost versus blast radius
Replicating everything everywhere is expensive. Not replicating enough limits failover. The architecture should be deliberate about what is globally portable, what is region-bound, and what is reconstructable from events.
Eventual consistency versus user expectations
Kafka and event-driven microservices make multi-region propagation practical, but they also make truth time-relative. If the customer updates an address in one region and immediately checks out in another, what should happen? “It depends” is the correct answer, but architecture has to codify the dependency.
Solution
The solution is to make routing decisions based on domain semantics plus topology metadata, not just network health.
In plain terms:
- classify operations by business semantics
- map those semantics to bounded contexts and data authority
- enrich the platform with topology metadata
- route requests to the region that is appropriate for that operation
- provide explicit fallback behavior when the ideal region is unavailable
- use events and reconciliation to restore global coherence
This is not one routing rule. It is a routing model.
At the core, every externally visible operation should be assigned a routing class. For example:
- Local Read: serve from nearest region using replica or cache
- Authoritative Read: serve from region owning the aggregate or domain authority
- Local Write with Async Replication: accept in regional authority and publish events
- Home-Region Write: always route to customer or account home region
- Jurisdiction-Bound Operation: route only within approved geography
- Workflow-Scoped Routing: route to the region coordinating a saga or process manager
- Degraded Safe Mode: allow only idempotent or read-only operations during partition
That classification becomes the bridge between domain-driven design and infrastructure.
A bounded context should define which operations require authoritative decisions, what invariants matter, and whether a command may be accepted outside the owning region. Infrastructure then supplies region health, replication lag, affinity maps, sovereignty policies, and service dependency status. The routing layer combines them.
This is why I prefer a policy-based routing service or control plane over scattering conditional logic through API gateways and service code. Keep the runtime path fast, but make the decision model explicit, testable, and governed.
High-level flow
The point of this diagram is not the boxes. It is the separation of concerns. The router should know enough to choose, not so much that it becomes the business application.
Architecture
A workable architecture for topology-aware routing in multi-region systems usually has six parts.
1. Global entry layer
This is your DNS, CDN, global load balancer, or API entry point. Its job is not to solve all routing. Its job is to terminate traffic intelligently and hand requests to a routing-aware edge or gateway. Keep this layer simple and fast.
Use it for broad concerns:
- nearest edge selection
- TLS termination
- DDoS protection
- coarse regional failover
- static content delivery
Do not bury domain semantics here if you can avoid it. The global entry point should know less than people want it to.
2. Routing policy engine
This is the heart of the pattern. It evaluates:
- request type and operation
- user or tenant residency
- customer/account home region
- domain authority map
- replication lag and dependency readiness
- compliance and sovereignty constraints
- current degradation policy
This can be implemented in different ways:
- gateway plugin backed by a policy service
- sidecar or service mesh extension
- dedicated routing service for internal calls
- centralized control plane distributing policy to gateways
My bias: use a centralized policy model with distributed enforcement. Enterprises need governance and auditability. They also need low-latency execution. EA governance checklist
3. Domain authority registry
This is often missing. It should not be.
You need a source of truth for which region is authoritative for a given domain entity or aggregate root. Examples:
- customer mastered in eu-west
- account mastered in us-east
- order authority in region where order was created
- contract authority fixed by jurisdiction
This is not a generic config table. It is domain architecture made executable.
In DDD terms, the aggregate boundary and bounded context determine where invariants are enforced. Routing must honor that. If the Customer context owns customer preferences by residency, but the Payments context owns ledger movement by account domicile, then a single end-to-end journey may require multiple routing decisions. Better to model that intentionally than pretend there is one region for everything.
4. Regional execution stacks
Each region contains a meaningful slice of the system:
- stateless microservices
- caches
- databases or replicas
- Kafka cluster or regional event infrastructure
- observability stack
- integration adapters
Not every service must exist in every region. That is one of the biggest sources of waste in naive multi-region designs. Deploy according to domain need, not symmetry worship.
5. Event propagation and reconciliation
Multi-region systems that matter will use asynchronous messaging. Kafka is the obvious candidate because it gives durable logs, replay, partitioning, and broad ecosystem support. But Kafka does not remove the need for reconciliation. It simply gives you a decent spine.
When writes occur in one region and projections or downstream actions materialize elsewhere, there will be delay, duplication, reordering, and occasional divergence. A proper topology-aware design includes:
- idempotent consumers
- event versioning
- per-region replay capability
- anti-entropy or reconciliation jobs
- business-level conflict rules
- compensating workflows where needed
You do not have a multi-region architecture until you have a reconciliation story.
6. Degradation and failover policy
A region being “down” is the least interesting case. More dangerous is being half alive.
Examples:
- API responds, but Kafka replication is stalled
- database replica is available, but lag exceeds business threshold
- payment service is healthy, but fraud service in-region is degraded
- customer profile write succeeds locally, but outbound events are blocked
Routing policy must understand service dependency posture, not just endpoint health.
Reference architecture
Domain Semantics: the Part People Skip
Routing is often treated as a technical concern because it touches the network first. That is backwards.
The first question is not, “Which region is up?” The first question is, “What does this operation mean?”
Consider three operations in retail banking:
GET /accounts/{id}/balancePOST /transfersPATCH /customer/contact-details
All three may involve the same customer and account. But they have different domain semantics.
A balance read may come from a local read model if the freshness window is acceptable and disclosed. A transfer command must execute in the region where the ledger aggregate is authoritative. Contact details may be mastered in the customer domicile region and then propagated to product and channel systems asynchronously.
This is DDD doing useful work. Bounded contexts define what “truth” means for each operation. Aggregates define where invariants must hold. Published language helps prevent one team’s “account owner region” from becoming another team’s “preferred service region.”
Topology-aware routing should therefore be designed per bounded context, not as one universal algorithm.
That sounds heavier than it is. In practice, you define a small taxonomy of operation classes and let each bounded context map its endpoints or message types to that taxonomy.
Migration Strategy
Most enterprises cannot pause and redesign routing from scratch. They need a progressive migration. This is where the strangler pattern earns its keep.
The migration should not start by moving all traffic. It should start by making routing decisions visible.
Stage 1: Observe
Instrument the current system:
- where requests enter
- which region actually serves them
- cross-region service calls
- data store authority and replica lag
- Kafka topic replication and consumer delay
- user or tenant geography
- failed retries by region
Build a topology map from reality, not from architecture diagrams last edited before the last reorg.
Stage 2: Classify operations
Define routing classes with domain teams. This is the critical workshop. Get product, legal, operations, and architecture in the room. For each important operation ask:
- where is authority?
- can it be served from a replica?
- what freshness is acceptable?
- what is the failover posture?
- what compliance boundaries apply?
- what is the customer expectation during regional impairment?
You will find ambiguity. Good. Better in a workshop than in production.
Stage 3: Introduce a routing facade
Place a policy-capable gateway or routing facade in front of selected services. Start with read paths or low-risk APIs. Keep the existing service contracts. Do not rewrite the world.
Stage 4: Externalize authority and affinity metadata
Move region selection logic out of applications and into a domain authority registry plus routing policy. Keep applications topology-aware enough to emit useful metadata, but not topology-decisive everywhere.
Stage 5: Strangle writes carefully
Migrate write operations bounded context by bounded context. Use dual-run where appropriate:
- old route remains active
- new policy route shadows or mirrors decisioning
- compare outcomes
- cut over incrementally by tenant, geography, or operation type
For event-driven systems, produce canonical events at the point of authoritative write and reconcile downstream projections.
Stage 6: Reconcile and retire
Expect mismatches. During migration, the old and new worlds will disagree. Build reconciliation pipelines to compare authoritative stores, projections, and Kafka-derived read models. You are not done when traffic shifts. You are done when data behavior is trustworthy.
Migration flow
A strangler migration succeeds because it narrows risk. It also reveals where your domain model was never explicit. That discovery is often the real project.
Enterprise Example
Consider a global payments company processing card transactions, merchant settlements, refunds, and compliance reporting across Europe, North America, and Asia-Pacific.
At first glance, the requirement sounds straightforward: serve customers from the nearest region for low latency and fail over during outages.
That design fails almost immediately.
Why? Because “payments” is not one thing. Authorization, fraud scoring, ledger posting, settlement, and customer support views all have different semantics.
Bounded contexts
- Authorization
Needs very low latency, may use local risk signals, but must honor card network and issuer rules.
- Fraud
Needs broad data visibility, may use globally aggregated features, often tolerates asynchronous enrichment.
- Ledger
Requires strict invariants. Double-entry bookkeeping does not care about your latency target.
- Merchant Settlement
Follows contractual and jurisdictional rules, often batch-influenced.
- Customer Support View
Can tolerate eventual consistency if clearly presented.
Routing decisions
- Card authorization requests enter at the nearest regional edge, but route to the merchant’s operating region unless network or issuer constraints dictate otherwise.
- Ledger postings always route to the account authority region.
- Fraud scoring may call a regional model with replicated features, but critical block decisions use authoritative policy versions.
- Refunds route to the original transaction authority region to avoid duplicate financial state.
- Support dashboards serve from local read models annotated with freshness and reconciliation status.
Kafka is used to publish transaction events regionally, replicate selected topics, and build support and reporting views in each geography. But ledger authority is not globally active-active. That would be architectural bravado, not prudence.
During a partial EU outage, the company does not fail all payment writes to the US. Instead:
- support views fail over to replicated read models
- low-risk customer profile reads continue locally
- ledger-affecting commands for EU authorities enter degraded mode
- queued operations are replayed after recovery
- reconciliation verifies event completeness and posting sequence
This sounds conservative because it is. Finance systems punish optimism.
Operational Considerations
Topology-aware routing creates a better architecture, but it also creates more moving parts. You need operational discipline.
Observability must be topology-native
Every trace, log, and metric should carry topology attributes:
- ingress region
- execution region
- authority region
- tenant or customer region
- data freshness classification
- replication lag observed at decision time
- routing policy version
Without this, incident analysis becomes folklore.
Health is multidimensional
Do not route on binary health checks alone. Use composite health:
- endpoint health
- database write availability
- replica lag thresholds
- Kafka producer and consumer health
- critical dependency status
- policy distribution freshness
A region may be healthy for reads and unhealthy for authoritative writes. Your routing model should express that.
Policy changes need governance
Routing policy is production behavior. Treat it like code:
- version it
- test it
- review it
- canary it
- audit it
You are changing not just traffic patterns but business correctness.
Idempotency is mandatory
Any write that might be retried across regions must be idempotent. This is non-negotiable. Request keys, command IDs, and deduplication windows are not “nice to have” in multi-region systems.
Reconciliation is an operating model
Do not relegate reconciliation to a batch afterthought. Make it visible:
- divergence dashboards
- replay tooling
- exception queues
- business-owned resolution workflows
Some mismatches are technical. Others are business disputes wearing technical clothes.
Tradeoffs
There is no free lunch here. Topology-aware routing pays off because it chooses its costs deliberately.
Benefits
- lower latency where locality is safe
- stronger correctness for authority-bound operations
- clearer data sovereignty enforcement
- reduced accidental cross-region traffic
- more predictable failover behavior
- domain-aligned routing decisions
Costs
- more metadata to maintain
- more complex control plane
- additional governance overhead
- harder testing across regions and failure states
- more sophisticated observability requirements
- potential coupling between domain model and platform policy
The biggest tradeoff is philosophical: you give up the fantasy that any request can run anywhere. In return, you gain a system that behaves like the business actually works.
That is a good bargain.
Failure Modes
This pattern has sharp edges. Better to name them plainly.
Stale authority maps
If the router believes an entity belongs to region A when it has been migrated to region B, requests may oscillate, fail, or write to the wrong place. Treat authority metadata as critical state.
Split-brain writes
If failover rules permit writes in multiple regions without a safe conflict model, you will get divergence. In customer preference systems this is annoying. In ledgers it is catastrophic.
Hidden dependency asymmetry
A service may be deployed in all regions, but one of its dependencies is not. Routing traffic there creates a “healthy shell, broken core” failure. Dependency-aware health is essential.
Kafka replication assumptions
Teams often assume cross-region topic replication equals business consistency. It does not. Replicated events may arrive late, out of order, or without corresponding side effects applied. Event logs are necessary, not sufficient.
Policy drift
Different gateways or meshes running different policy versions can produce inconsistent routing for the same request. Centralized governance with distributed rollout matters. ArchiMate for governance
User experience incoherence
If reads are local and writes are remote, users may observe old state after a successful command. Sometimes acceptable. Often not. You need explicit freshness semantics in the interface and experience design.
When Not To Use
Topology-aware routing is powerful. It is also easy to overapply.
Do not use this pattern if:
- you have a small system with one primary region and passive DR
- your data model is simple enough that whole-system failover is acceptable
- domain semantics do not vary meaningfully by operation
- regulatory constraints are minimal
- your team cannot yet operate distributed event-driven systems reliably
- you lack the organizational maturity to maintain routing policy and reconciliation
In many cases, a simpler model is better:
- one write region, many read replicas
- region pinning by tenant
- coarse DNS failover
- CDN for global reads
- deferred investment until scale or regulation truly demands more
Architecture should solve today’s important problems, not tomorrow’s hypothetical keynote.
Related Patterns
Topology-aware routing does not stand alone. It works with several adjacent patterns.
Cell-based architecture
Cells reduce blast radius by grouping users and services into isolated units. Topology-aware routing complements this by deciding which cell and region should serve a request.
Strangler fig pattern
Ideal for introducing routing policy gradually around legacy services. Especially useful when old systems have hardcoded regional assumptions.
CQRS
Helpful when local reads can be served from regionally replicated projections while authoritative writes stay pinned to a home region.
Saga orchestration
Useful for multi-step workflows crossing bounded contexts. But be careful: saga coordinators themselves may need topology-aware placement.
Outbox and transactional messaging
Critical when authoritative writes must emit Kafka events reliably for downstream replication and reconciliation.
Data mesh, carefully applied
For analytical domains, federated ownership and regional data products may fit well. For operational truth in transactional domains, do not mistake federated access for shared authority.
Summary
Topology-aware routing is what happens when a multi-region architecture grows up.
It recognizes that requests are not interchangeable, regions are not symmetrical, and business operations carry semantics that infrastructure must respect. The right routing decision depends on more than network latency. It depends on domain authority, data residency, replication state, workflow ownership, and failure posture.
The winning design is usually not “send everything to the closest healthy region.” It is a policy-driven model where bounded contexts define operation semantics, a routing layer evaluates topology and authority metadata, and asynchronous propagation plus reconciliation keep the wider estate coherent.
That model gives enterprises something better than simple failover. It gives them predictable behavior under stress. And that is what architecture is really for.
If you are migrating toward this pattern, start small. Observe reality. Classify operations. Externalize policy. Strangle selectively. Reconcile relentlessly. Keep writes conservative and reads opportunistic. Most of all, let the domain tell the network what matters.
Because in multi-region systems, geography is never just geography. It is policy. It is data. It is power. And if your routing does not know that, production certainly will.
Frequently Asked Questions
What is enterprise architecture?
Enterprise architecture aligns strategy, business processes, applications, and technology in a coherent model. It enables impact analysis, portfolio rationalisation, governance, and transformation planning across the organisation.
How does ArchiMate support architecture practice?
ArchiMate provides a standard language connecting strategy, business operations, applications, and technology. It enables traceability from strategic goals through capabilities and services to infrastructure — making architecture decisions explicit and reviewable.
What tools support enterprise architecture modeling?
The main tools are Sparx Enterprise Architect (ArchiMate, UML, BPMN, SysML), Archi (free, ArchiMate-only), and BiZZdesign. Sparx EA is the most feature-rich, supporting concurrent repositories, automation, scripting, and Jira integration.