⏱ 20 min read
Versioning is where architecture stops being a clean drawing and starts acting like a city map. On the whiteboard, it looks harmless: v1 here, v2 there, a gateway in front, maybe a routing rule or two. In production, it becomes traffic management during roadworks in the middle of downtown. The streets are still open. People still need to get to work. Emergency vehicles still need right of way. And half the signs point to roads that no longer exist.
That is the real shape of API versioning in distributed systems. It isn’t just about putting /v2 in a URL. It is about controlling semantic change across a living network of consumers, services, events, caches, contracts, retries, and human habits. The minute you have multiple bounded contexts, asynchronous messaging, mobile clients that update late, and reporting systems nobody dares touch, “versioning” stops being an interface concern and becomes a topology concern.
And topology matters. Because the hard question is not whether you have v1 and v2. You almost certainly do. The hard question is: where do they coexist, who translates, who owns the semantic gap, and how do you shut the old road without crashing the city?
This article takes that question seriously.
We will look at API versioning as an architectural topology in distributed systems, especially where a v1/v2 routing graph crosses microservices, Kafka-driven integration, and domain boundaries. We will look at progressive strangler migration, reconciliation, failure modes, and tradeoffs. We will also be blunt about when not to use these patterns. Versioning is often necessary. It is not always noble. event-driven architecture patterns
Context
In simple systems, API versioning is often presented as a choice between URL versioning, header versioning, or content negotiation. That is useful in the same way a subway map is useful when you’re standing in a field. It helps, but only after the roads exist.
Distributed systems force a more practical view. An API is rarely a single implementation. It is usually an entry point into a graph:
- edge gateway or API management layer
- authentication and authorization filters
- orchestration or BFF services
- domain microservices
- event streams
- operational data stores
- analytics and batch consumers
- downstream third-party integrations
Now insert change into that graph.
Suppose Customer Profile v1 exposes a single “customer” resource with flattened contact details. In v2, the business wants explicit distinctions between legal identity, communication preferences, consent, and market-specific address rules. That is not merely a field rename. That is a domain refinement. The old API spoke in one language; the new one speaks in another.
This is where domain-driven design earns its keep. Good versioning architecture begins by asking whether v2 is:
- a technical evolution of the same domain concept,
- a new representation of the same capability, or
- evidence that the model itself has changed and should belong to a different bounded context.
Too many enterprises version APIs because they are trying to hide domain confusion behind HTTP. That never ends well. If your v2 requires translators, compensations, data repair, and a governance committee, there is a decent chance you do not have “an API versioning problem.” You have a domain model problem with an API-shaped symptom. EA governance checklist
Problem
The practical problem is this: how do you introduce a materially different API contract without breaking existing consumers, while preserving operational stability and allowing the domain model to evolve?
That breaks down into several ugly sub-problems:
- old consumers may be slow to migrate, or may never migrate
- new domain semantics may not map cleanly to old ones
- multiple services may need coordinated change
- event-driven systems may carry both old and new schemas simultaneously
- writes through v1 and v2 may create divergent state
- reporting and reconciliation may disagree on truth
- gateway routing rules may accidentally split behavior in ways the business never intended
In a monolith, a version can often be a compatibility layer inside one codebase. In a distributed system, versions become paths through a graph. A request entering as v1 may hit a v1 façade, then a translator, then a v2 domain service, then emit both legacy and canonical Kafka events, then write to two stores, then feed a nightly reconciliation job. Another request entering as v2 may bypass half that path. Those are not just versions. They are different topologies.
And once topologies differ, operational characteristics differ too:
- latency differs
- consistency windows differ
- observability differs
- authorization paths differ
- failure surfaces differ
A versioning decision is therefore also a runtime architecture decision.
Forces
There are several forces pulling in opposite directions.
Backward compatibility versus domain integrity
Enterprises love backward compatibility because consumers are expensive to coordinate. But backward compatibility often means preserving domain mistakes. If v1 flattened concepts that should have been distinct, preserving it forever pollutes the core model.
The discipline is to protect consumers at the boundary, not contaminate the center. That is a very DDD way of thinking: preserve the ubiquitous language within the bounded context, and translate at the edges where necessary.
Consumer autonomy versus platform control
Teams consuming APIs want freedom to upgrade on their own schedule. Platform teams want deprecation and retirement to be enforceable. A routing graph with canary rules, feature flags, and migration telemetry gives platform control without forcing a big-bang cutover.
But this control has a cost: more routing, more adapters, more places for ambiguity to hide.
Synchronous request stability versus asynchronous event evolution
HTTP contracts can be versioned explicitly. Kafka contracts are trickier. Topics can be versioned, schemas can evolve, or upcasters can transform old events into canonical forms. None of these choices are free.
A mature enterprise often needs both:
- stable consumer-facing APIs
- evolving internal event contracts
That split is healthy, but only if you’re clear about where canonical truth lives.
Time-to-market versus technical drag
A quick v2 endpoint can be shipped with a forked codepath and a few if version == 2 branches. It works—until every release becomes archaeology. The faster you ship the wrong versioning topology, the slower every future change becomes.
The old line applies: there is nothing more permanent than a temporary compatibility layer.
Centralized mediation versus distributed translation
Should the API gateway route and transform? Or should each service own version translation? Central mediation simplifies consumer entry points but can become a logic swamp. Distributed translation keeps domain teams accountable but duplicates effort.
As ever, the answer depends on semantic complexity. Syntax can often be mediated centrally. Semantics usually cannot.
Solution
The most robust approach in distributed systems is to treat API versioning as a routing topology with explicit compatibility boundaries, not a naming convention.
The pattern works like this:
- Keep a canonical domain model inside the bounded context.
- Expose versioned contracts at the edges only.
- Route v1 and v2 through explicit façades or adapters.
- Use progressive strangler migration to move behavior from legacy implementations to canonical services.
- Emit canonical events internally; bridge legacy events only where needed.
- Use reconciliation to detect semantic drift during coexistence.
- Retire old routes aggressively once business risk falls.
The crucial design move is this: do not let v1 and v2 become equal citizens in the core. One of them must be the compatibility surface. Usually, that should be v1.
If v2 reflects the better domain language, then the architecture should bias toward v2 as the canonical path. v1 becomes a legacy façade with translation and compatibility logic. This keeps the inside of the system coherent while the outside changes gradually.
Here is the basic v1/v2 routing graph in a progressive migration model:
This is opinionated, and deliberately so. The domain service should not be littered with version branches. The version split should be visible and governable. If a team cannot point to the translation boundary, they probably do not really know where the semantic difference lives.
Architecture
There are several versioning topologies worth distinguishing.
1. Parallel stack topology
In the parallel stack model, v1 and v2 each have separate end-to-end implementations.
This is tempting because it isolates change. It is also expensive because every capability, policy, and fix may have to be implemented twice. Parallel stacks are acceptable for short-lived transition or when v2 truly belongs to a different bounded context. They are poison when used as a long-term convenience.
Use parallel stacks when:
- semantics are radically different
- data stores differ fundamentally
- there is low overlap in business behavior
- migration is time-boxed and funded
Do not use them because teams don’t want to talk.
2. Edge façade topology
In this model, both versions enter through distinct façades, but share a canonical domain implementation underneath. This is the sweet spot for most enterprise migrations. Version-specific representation and compatibility logic live at the edge, while invariants stay in the core.
This topology is ideal when:
- v2 is a domain refinement
- old and new APIs still operate on the same business capability
- the business cannot tolerate big-bang migration
- observability and deprecation need central control
3. Gateway transformation topology
Here, the API gateway rewrites v1 to v2 or vice versa. This works for shallow syntactic changes: renamed fields, moved headers, basic request shims.
It fails for meaningful semantic change. Gateways are good at routing and policy. They are bad places to encode domain decisions like consent inheritance rules, pricing semantics, or customer identity resolution. The minute business meaning enters the gateway, you have built a distributed anti-pattern with excellent dashboards.
4. Event-canonical topology
In Kafka-heavy estates, the real center of gravity may be the event model rather than the request API. In that case, versioned APIs should translate into a canonical command/event flow, and downstream consumers should be protected through schema evolution, compatibility checks, and selective topic bridges.
That architecture looks like this:
The canonical event model should reflect domain truth, not legacy reporting shortcuts. If old consumers require a bridge topic, give them one, instrument it, and put an end date on it.
Domain semantics and bounded contexts
Versioning decisions become much clearer when framed with DDD.
Suppose an insurance enterprise had a v1 Policy API that treated “quote” and “policy” as lifecycle states of the same resource. In v2, underwriting, compliance, and fulfillment each needed explicit boundaries, because quote risk assessment and issued policy obligations are not the same thing. A team that simply versions the endpoint and adds fields will drag ambiguity forever. A team that recognizes a boundary shift can separate quoting from policy administration and use versioning as a migration tool rather than a disguise.
That is the essence of domain-driven versioning:
- preserve the language of the bounded context
- expose translations at context boundaries
- accept that some “versions” are actually context refactorings
If v2 changes the meaning of an aggregate, an invariant, or the identity boundary, treat it as a domain change first and an API version second.
Migration Strategy
The right migration strategy is rarely “launch v2 and ask everyone nicely to move.”
Enterprises need a progressive strangler approach.
Step 1: establish a canonical target
Decide what the core model is. This is not a technical step dressed as architecture. It is the architectural step. Without a clear target model, migration becomes endless dual maintenance.
Step 2: isolate legacy behavior behind a v1 façade
Keep v1 stable for consumers, but redirect implementation responsibility toward the canonical model. That may require request translation, response shaping, legacy ID mapping, and temporary data enrichment.
Step 3: dual-run where needed
For critical flows, especially writes, run the legacy and canonical processing paths in parallel for a period. Compare outputs. Publish discrepancies. This is where reconciliation enters.
Step 4: reconcile aggressively
Reconciliation is the adult supervision of migration. It answers:
- did v1 and v2 produce equivalent business outcomes?
- did both emit the expected events?
- did downstream ledgers, reports, and customer-visible states align?
- where did semantic loss occur?
A practical reconciliation design often includes:
- correlation IDs across both flows
- canonical business keys
- comparison jobs for materialized state
- event completeness checks in Kafka
- exception queues with triage ownership
Step 5: route by cohort
Do not cut all traffic at once. Route by:
- internal consumers first
- low-risk partner channels next
- selected tenant or region cohorts
- then general traffic
This gives you a routing graph that evolves over time, not in a single violent switch.
Step 6: retire old writes before old reads
This is a subtle but important rule. Legacy reads can often survive longer because they are easier to shape from canonical state. Legacy writes are where divergence enters. The sooner you stop allowing old write semantics, the easier the system becomes to reason about.
A migration graph might look like this:
Migration reasoning
Why strangler over big-bang? Because in distributed systems, unknown dependencies are not edge cases. They are the default condition. There will always be an old mobile client, a spreadsheet integration, a nightly ETL job, or a regional partner with a contract everyone forgot to tell you about.
The strangler pattern acknowledges reality. It creates a governed path to replacement while preserving service continuity. It is slower upfront than a fantasy cutover, and far cheaper than recovering from one.
Enterprise Example
Consider a global retail bank modernizing its customer onboarding platform.
The starting point
The bank had:
- a public
Customer API v1 - a CRM service
- a KYC/AML service
- a consent service
- Kafka topics feeding fraud, marketing, and reporting
- several mobile and branch applications
- a mainframe-backed customer master
The v1 API represented onboarding as one transaction: create customer, add address, capture consent, perform KYC. It was convenient for channel teams. It was also a domain mess. Consent is not identity. KYC is not profile capture. Address verification has market-specific rules. Yet v1 flattened them into one command.
The bank wanted v2 to:
- separate customer identity from marketing consent
- support asynchronous KYC outcomes
- model legal and preferred names distinctly
- emit canonical domain events for downstream services
- improve resilience for partial failures
The architectural move
Instead of rebuilding all channels at once, the bank created:
- a v2 onboarding façade exposing clearer task-oriented APIs
- a canonical customer domain service
- a consent bounded context with its own aggregate and event stream
- a v1 compatibility façade translating the old request into canonical commands
- Kafka bridges for legacy reporting topics
This was the right move because v2 was not just “v1 with more fields.” It reflected a better domain decomposition.
What happened in practice
For six months, branch applications stayed on v1. Mobile moved to v2 early. Both ultimately invoked the canonical services underneath.
The hard part was not the HTTP contract. The hard part was reconciliation:
- v1 treated onboarding success as a synchronous outcome
- v2 treated KYC as asynchronous
- legacy reporting expected same-day “customer created” counts
- compliance needed exact consent lineage
- duplicate customer detection behaved differently under the new identity rules
The bank solved this by introducing a migration ledger keyed by correlation ID. Every onboarding attempt recorded:
- ingress API version
- mapped customer identifier
- command execution states
- emitted Kafka events
- downstream completion markers
- reconciliation status
This ledger exposed semantic gaps quickly. One early defect was memorable: customers created through v1 façade translation emitted canonical customer events correctly, but the legacy marketing opt-in bridge inferred consent from an omitted field and marked some records as “unknown” rather than “false.” Reporting looked fine. Compliance did not. Without explicit reconciliation, that bug would have become an audit finding.
That is why migration architecture is not ceremony. It is operational truth-telling.
Outcome
After phased migration:
- all writes moved to canonical services
- v1 remained only as a read compatibility façade for a shrinking set of branch workflows
- Kafka legacy bridge topics were retired in stages
- consent became independently auditable
- onboarding latency improved for v2 channels because KYC no longer blocked the whole transaction
The important point is this: the bank did not “upgrade an API.” It reworked the domain shape and used versioning topology to survive the transition.
Operational Considerations
Version coexistence has to be visible. If you cannot answer “what percentage of revenue-impacting calls still go through v1?” you are not managing a migration; you are hosting one.
Key operational concerns include:
Observability
Track:
- request volume by version, consumer, and route
- translation failures
- semantic fallback usage
- latency by topology path
- dual-write discrepancies
- Kafka bridge lag
- reconciliation exceptions
- version retirement burn-down
A good dashboard distinguishes between:
- syntax errors
- translation errors
- domain rule violations
- downstream delivery issues
Those are different failure classes and demand different owners.
Contract governance
Use schema registries for Kafka, consumer-driven contract testing for synchronous APIs, and explicit deprecation policy with dates. Deprecation without enforcement is just wishful thinking in a Confluence page.
Security and policy drift
Versioned paths often diverge in authorization behavior by accident. A v2 route may enforce field-level permissions while a v1 compatibility façade may not. This is a common enterprise failure mode. Security policy must be tested across versions, not assumed to be inherited.
Data lineage
When v1 and v2 map differently to canonical fields, lineage matters. Auditors and support teams need to know where a value came from, how it was translated, and whether it was inferred or explicitly supplied.
Capacity and cost
Compatibility layers are not free. During coexistence, you may:
- process writes twice
- store extra mapping tables
- run reconciliation jobs
- maintain bridge consumers and publishers
- keep duplicate caches warm
Migration budgets should include these costs. They are not implementation trivia; they are the economic shape of coexistence.
Tradeoffs
There is no perfect versioning topology. There are only costs chosen consciously or unconsciously.
Edge compatibility preserves core purity
This is the strongest option for long-term maintainability. But it front-loads design effort into translation boundaries and canonical model clarity.
Parallel stacks reduce immediate coupling
They are easier politically when teams are split or time is short. They are worse strategically because they double operational burden and often delay retirement indefinitely.
Gateway transformation is quick
It works for syntax, headers, and simple rewrites. It is dangerous for domain semantics and often leads to hidden business logic in infrastructure.
Dual publishing eases migration
Publishing both canonical and legacy events can reduce downstream breakage. It also creates consistency risk and increases event governance overhead. ArchiMate for governance
Reconciliation increases confidence
It also adds machinery, storage, and process. Some teams resist it because it “slows delivery.” Those teams generally learn about reconciliation later, in production, under less friendly conditions.
A good architect does not pretend these costs vanish. The point is to spend complexity where it decays, not where it compounds.
Failure Modes
Versioning fails in familiar ways.
Semantic drift hidden as compatibility
The API appears backward compatible, but business meaning has changed. Consumers keep working technically while producing wrong outcomes.
Forever-v1 syndrome
Nobody sets retirement dates, so v1 persists for years. Every new feature now requires compatibility logic, and the old version becomes the de facto center.
Split-brain writes
v1 and v2 both mutate state through different paths, producing divergent records, duplicate events, or conflicting side effects.
Legacy bridges becoming permanent integration surfaces
A temporary Kafka bridge topic becomes depended on by three more systems. Congratulations, you have created a new platform you did not intend to support.
Translation logic in the wrong place
Gateways, shared libraries, or ESB flows accumulate semantic mapping logic nobody owns. When the domain changes, all of it breaks in different ways.
Incomplete observability
Traffic shifts to v2, but operators cannot see which downstream failures are specific to translated v1 paths. Outages become blame archaeology.
Reconciliation theater
Teams create reconciliation jobs that count records but do not compare business semantics. Numbers match; truth does not.
When Not To Use
Not every change deserves a new version topology.
Do not introduce full v1/v2 routing graphs when:
- the change is additive and consumers can safely ignore new fields
- the domain semantics are unchanged
- your consumer base is tightly controlled and can migrate in lockstep
- the API is internal and owned by one team
- the operational cost of coexistence outweighs the business value
In these cases, prefer compatible evolution:
- additive fields
- optional attributes
- tolerant readers
- schema evolution in Kafka
- explicit deprecation of old fields without endpoint splits
Also, do not use API versioning to postpone domain decisions. If your team cannot explain what changed in the ubiquitous language, versioning will not save you. It will only preserve ambiguity with better routing.
Related Patterns
Several adjacent patterns matter here.
Strangler Fig Pattern
The backbone of progressive migration. Replace behavior incrementally while preserving outward continuity.
Anti-Corruption Layer
Essential when v1 semantics are poor but must still be supported. The v1 façade is often an anti-corruption layer protecting the canonical domain.
Backend for Frontend
Useful when channel-specific needs differ, but not a substitute for semantic versioning. A BFF can tailor representation without polluting core services.
Consumer-Driven Contracts
Helpful in understanding which consumers actually depend on which behaviors. Very useful before deprecation.
Event Upcasting
A good fit when Kafka consumers need old schemas translated into a canonical in-memory representation. Better than topic explosion in some cases, but only if managed carefully.
Outbox Pattern
Important when canonical state changes and event publication must stay reliable during migration, especially with dual publishing and reconciliation.
Summary
API versioning in distributed systems is not a matter of endpoint cosmetics. It is a question of topology: where requests go, where semantics change, where truth lives, and how old roads are closed safely.
The best architectures are clear about a few things:
- the canonical domain model lives inside bounded contexts
- version compatibility belongs at the edges
- semantic translation must have an owner
- progressive strangler migration beats heroic cutovers
- Kafka event evolution needs the same discipline as HTTP contracts
- reconciliation is not optional when semantics are changing
- old write paths should die before old read paths
- every bridge needs an exit plan
If there is one line worth remembering, it is this: version the boundary, not the heart of the domain.
That is the difference between a system that evolves and a system that merely accumulates history.
Frequently Asked Questions
What is a service mesh?
A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.
How do you document microservices architecture for governance?
Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.
What is the difference between choreography and orchestration in microservices?
Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.