Rolling Data Migrations in Microservices

⏱ 20 min read

Most data migrations fail for a boring reason: the team treats them like plumbing.

They are not plumbing. They are surgery performed while the patient is running a marathon.

That is the real shape of rolling data migration in microservices. You are not merely copying rows from one database to another. You are changing the living memory of a business while orders are still being placed, claims are still being adjudicated, payments are still being authorized, and customers are still refreshing their screens. In an enterprise estate, the hard part is rarely the mechanics of moving bytes. The hard part is preserving meaning. Data has gravity, but domain semantics have consequences.

This is why so many “simple migrations” become year-long programs with war rooms, spreadsheets, emergency scripts, and exhausted architects. Teams underestimate the amount of implicit business behavior embedded in persistence models, integration contracts, reporting pipelines, and user expectations. They migrate the data shape and forget the business truth that gave it shape.

A rolling data migration is the only sane answer when downtime is unacceptable, data volumes are large, and microservices must keep operating throughout the move. It lets you migrate gradually, service by service, aggregate by aggregate, capability by capability. But it only works if you think in domain boundaries, compatibility windows, reconciliation loops, and failure containment. This is not a database exercise. It is a socio-technical redesign.

The central idea is simple enough: instead of one giant cutover, you progressively shift reads, writes, and ownership from an old model to a new one, while both worlds coexist for a time. In practice, that coexistence is where the architecture earns its keep. Dual writes, backfills, CDC pipelines, Kafka topics, idempotent consumers, reconciliation jobs, strangler facades, semantic translation layers, and observability all become part of the migration machinery. event-driven architecture patterns

And that machinery must be designed like a first-class product. Because during migration, the migration is the system.

Context

In a healthy microservice landscape, each service owns its data and exposes behavior through APIs or events. That is the ideal. Enterprise reality is messier. Many estates arrive at microservices after years of shared databases, ERP customizations, integration buses, nightly ETL, and reporting extracts that no one dares touch. Data ownership is blurred. Schemas leak into neighboring teams. “Reference data” becomes a loophole for shared coupling. Historical records are interpreted differently by different applications.

Now add change.

A bank wants to decompose a customer master built around channels into a proper customer and party domain. An insurer wants to split policy servicing from claims. A retailer wants to carve inventory allocation away from an aging order management platform. A manufacturer wants to replace a monolithic product model with bounded contexts for catalog, pricing, and fulfillment. In each case, the existing data model is not merely old; it encodes assumptions that no longer fit the business.

That is where rolling migration enters. It is the architectural technique of moving data ownership and usage incrementally, without pausing the business. It often appears during a progressive strangler migration, where a new microservice capability is introduced alongside an old system and traffic is gradually redirected. microservices architecture diagrams

This matters especially in event-driven architecture. Kafka, change data capture, and service-owned stores make gradual transition feasible. They do not make it easy. Event streams are excellent conveyors of change, but merciless at exposing ambiguity. If the old system uses one notion of “customer,” the new service uses another, and analytics uses a third, no amount of event publishing will save you from semantic confusion.

The migration succeeds only when the domain model becomes clearer as the technical architecture becomes more distributed.

Problem

The core problem is not “how do we copy data?” It is “how do we change system-of-record ownership, data shape, and business behavior over time without breaking operations?”

That problem has several nasty sub-problems:

The source and target models rarely align one-to-one.
The source continues to change while migration is in progress.
Downstream consumers often depend on undocumented quirks.
Historical data is inconsistent, incomplete, or semantically stale.
Reports, audit trails, and compliance obligations require continuity.
Operational cutovers cannot tolerate long outages.
Different services move at different speeds.

This is why big-bang migration is seductive and usually wrong. It promises a clean switchover date, one final backfill, and a triumphant decommissioning. Enterprises love the theater of a cutover weekend. Architects should be suspicious of theater. Big-bang plans compress uncertainty into a single moment. If your assumptions are wrong, all failures arrive together.

Rolling migration does the opposite. It spreads uncertainty over time, where it can be measured, corrected, and contained.

But that introduces its own complexity. During the transition, old and new systems both matter. Data may be written in one place and read in another. Some aggregates may already be authoritative in the target service while others remain in the source. You now have to reason explicitly about write paths, read paths, lag, replay, duplicate events, reconciliation, and eventual consistency windows.

That is not a side issue. That is the architecture.

Forces

Several competing forces shape the design.

Continuity versus correctness

The business wants zero downtime. Compliance wants auditability. Operations wants low risk. Engineering wants a cleaner model. These goals can conflict. If you preserve continuity at all costs, you may tolerate long periods of semantic drift. If you force correctness too early, you risk business interruption.

Domain purity versus migration pragmatism

Domain-driven design tells us to model bounded contexts and preserve ubiquitous language. Good. Keep that. But migrations often require anti-corruption layers, temporary canonical events, translation services, and compatibility schemas that no one would design on a greenfield. Purists dislike these compromises. Sensible architects accept them—temporarily.

Throughput versus traceability

A high-volume migration may involve billions of records or event histories. Bulk backfill pipelines optimize throughput. Yet enterprise migrations also require traceability: which source record produced which target aggregate under which transformation rule at what time? If you cannot answer that, your audit conversation will be unpleasant.

Simplicity versus safe rollback

The more direct your migration path, the simpler the system. The more reversible your migration path, the more machinery you need. Shadow reads, dual writes, replayable topics, compensations, and reconciliation stores are not elegant. They are insurance.

Local service autonomy versus cross-estate coordination

Microservices encourage independent evolution. Migrations punish uncoordinated change. Topic contracts, data retention, sequencing assumptions, and identity mapping all need cross-team discipline during a migration window.

Solution

The practical solution is a rolling migration built around progressive ownership transfer, event-driven synchronization, reconciliation, and strangler-style traffic shifting.

At a high level:

Define the target bounded context and ownership clearly.
Introduce a new service and data store without immediate cutover.
Mirror change from the old world into the new world, often using CDC or domain events over Kafka.
Backfill historical data in controlled batches.
Reconcile continuously to identify drift and semantic mismatches.
Shift reads first, often through a facade or routing layer.
Shift writes later, once confidence is high.
Retire old ownership gradually, capability by capability.

This is the progressive strangler pattern applied to data, not just request routing.

The most important design choice is to migrate by domain capability, not by table. Tables are implementation details. Businesses operate on concepts: customer profile, policy renewal, order allocation, invoice settlement. If you migrate table by table, you create a technical plan disconnected from behavior. If you migrate by domain capability, you preserve semantic intent and can reason about business completeness.

A good migration architecture has four distinct lanes:

Source of change lane: where updates originate during each phase
Replication lane: how changes are propagated
Validation lane: how correctness is measured
Traffic lane: how reads and writes are routed

Keep these separate. Teams often bundle replication and validation together, or tie traffic cutover directly to bulk load completion. That creates brittle dependencies. Replication says data moved. Validation says data makes sense. Traffic routing says users may rely on it. Those are different milestones.

Architecture

A common architecture uses an old system of record, a new service with its own database, Kafka for propagation, and a migration control plane for backfill and reconciliation.

This pattern works because it allows the target service to be populated from both historical and live change streams. The backfill gets you to “mostly there.” The streaming lane keeps you current. Reconciliation tells you whether “mostly there” is good enough to trust.

Notice what is missing: direct shared-database access from other services into the new store. During migration, that temptation becomes strong. Resist it. The target service must own its persistence and publish clear contracts. Otherwise you simply recreate the coupling you are trying to escape.

Domain-driven design and semantic translation

This is where architecture stops being mechanical and starts being serious.

Suppose a legacy CRM has a Customer table that actually mixes consumer identity, business account hierarchy, channel preferences, and legal party information. The target architecture, guided by domain-driven design, separates these concerns into bounded contexts: Party, Customer Profile, Consent, and Account Relationship.

A naive migration copies fields. A proper migration interprets meaning.

That means:

source entities may split into multiple target aggregates
target IDs may not match source IDs
business rules may be re-evaluated during transformation
some source data may be discarded as non-domain noise
some target fields may need derivation from multiple legacy structures

This is why an anti-corruption layer matters. It protects the target model from being polluted by the source model. During migration, that layer often sits in the transformation pipeline or the consuming service. It translates old concepts into new ubiquitous language and makes semantic loss explicit.

If you skip this step, you get the worst of both worlds: modern infrastructure wrapped around a legacy conceptual mess.

Read and write routing across phases

Rolling migration usually progresses through phases like this:

There is a reason experienced architects shift reads before writes. Reads are easier to validate through shadowing and comparison. Writes establish ownership. Once you move writes, rollback becomes more complicated, especially if downstream services start depending on target-native events.

The facade can be an API gateway, BFF, dedicated routing service, or even an orchestration layer internal to the platform. What matters is that routing decisions are explicit and observable. If cutover is hidden in client code or scattered feature flags, you will not know which consumers are actually using which source of truth.

Migration Strategy

A disciplined rolling migration strategy usually has six steps.

1. Identify migration units by domain

Do not say “we are migrating the customer schema.” Say “we are migrating address management for retail customers” or “we are migrating inventory reservation for ecommerce orders.” A migration unit should map to a coherent behavior, a bounded context edge, and a measurable business outcome.

This keeps the blast radius small and lets you retire legacy ownership incrementally.

2. Build the target model and compatibility edges

Model the target service properly. Give it a service-owned database and clear aggregates. Then design temporary compatibility structures:

ID mapping tables
translation rules
fallback read logic
topic versioning rules
source-to-target provenance metadata

These artifacts are migration scaffolding. They are not elegant. They are necessary.

3. Backfill history

Bulk load historical data in batches. Preserve source version information and timestamps where possible. Mark every migrated record with lineage metadata: source key, extraction batch, transformation version, migration timestamp.

This is where many enterprises cut corners. They should not. Historical backfill is the part you will be interrogating during defects, audits, and executive reviews. “We think it came across correctly” is not a serious answer.

4. Stream live change

Use CDC from the source database where domain events do not exist, or consume business events if they do. Kafka is often the practical backbone because it supports decoupled consumers, replay, partitioning, and retention windows long enough to recover from consumer failure.

Still, CDC is not magic. It tells you what changed in storage, not what that change meant in business terms. If the legacy system can emit domain events with stable semantics, prefer them. If it cannot, CDC is a good bridge, but expect to enrich and normalize aggressively.

5. Reconcile continuously

Reconciliation is not a one-time validation task at the end. It is a permanent companion during migration.

There are two kinds:

Structural reconciliation: counts, keys, referential integrity, nullability, state transitions
Semantic reconciliation: balances, eligibility states, lifecycle status, consent state, policy coverage meaning

Structural reconciliation is easy to automate. Semantic reconciliation is where the real risk lives.

You need both.

A dedicated reconciliation service or pipeline should compare source and target views, classify drift, track trends, and expose confidence indicators. Some differences are expected due to eventual consistency or intentional transformation. Others indicate logic bugs, ordering issues, or hidden source assumptions.

6. Shift traffic progressively

Use progressive rollout:

shadow reads
canary reads
percentage-based read routing
tenant or region-based write cutover
capability-level ownership transfer

Do not flip all consumers at once. In enterprises, there is always one forgotten downstream dependency running a quarterly process that nobody mentioned in discovery.

Enterprise Example

Consider a global insurer modernizing policy administration.

The legacy platform stores policy, insured party, billing setup, endorsements, and product-specific fields in a highly normalized relational model shaped by a commercial package installed fifteen years earlier. Claims, billing, customer communications, and reporting all read from this model directly or through extracts. The company wants to create a separate Policy Service and Customer Service in a microservices architecture, with Kafka used for event propagation across underwriting, billing, and claims.

A junior team might begin by migrating policy tables into a new database and exposing CRUD APIs. That would be a mistake. The legacy “policy” concept is overloaded. It blends contract identity, insured risk, premium schedule, agent hierarchy, and lifecycle states that differ by product line. Property insurance and life insurance use the same storage structures with different interpretations. Copying the schema just recreates the confusion.

A better approach starts with domain decomposition:

Customer Service owns party identity, contact details, and consent
Policy Service owns policy lifecycle and contract identity
Billing Service owns payment schedule and receivables
Claims Service consumes policy coverage facts but does not own policy state

The migration begins with one narrow capability: policy inquiry for personal auto policies in one country. Historical policies are backfilled into the new Policy Service. CDC from the legacy platform streams policy updates to Kafka. A transformation service interprets package-specific database changes into target aggregates. Reconciliation checks that policy status, effective dates, endorsements, and premium summary align between old and new views.

Initially, customer service representatives still write changes through the legacy system. The new Policy Service is read-only and fed asynchronously. The portal then shifts inquiry reads for a pilot region to the new service. Shadow comparison shows 98.7% alignment. The remaining 1.3% reveals hidden assumptions around reinstatement endorsements and backdated cancellations. The rules are fixed. Confidence improves.

Only later does the insurer move a write capability: address update endorsements for personal auto. The facade routes those writes to the new service for pilot brokers. The new service publishes policy-changed events to Kafka, and the legacy platform receives compatible updates for downstream systems that still depend on it. Eventually, policy inquiry and selected endorsement flows become target-owned, while more complex commercial lines remain on the old system.

That is what a real enterprise migration looks like. Uneven. Product-line specific. Full of semantic traps. Successful because the team migrated business capability, not just storage.

Operational Considerations

Rolling migrations are operational programs as much as design exercises.

Observability

You need migration-specific telemetry, not just normal service metrics.

Track:

replication lag by aggregate type
backfill progress and throughput
drift counts by severity
read routing percentages
write ownership by tenant or region
duplicate and out-of-order event rates
reconciliation exception aging

A migration without a dashboard is a rumor.

Idempotency and replay

Kafka consumers must be idempotent. Replays will happen. Duplicate delivery will happen. Backfills may overlap with live streams. If processing the same event twice corrupts the target state, your migration architecture is fragile.

Use business keys, version checks, monotonic event sequencing where available, and commutative update logic when practical.

Data lineage

Every migrated record should tell a story: where it came from, when it was transformed, by which rule version, and whether it has been reconciled. This is essential for support, audit, and rollback analysis.

Security and compliance

Migrations often replicate regulated data into new stores, pipelines, and topics. That changes your security surface. PII may now exist in Kafka topics, staging buckets, reconciliation stores, and temporary extracts. Encryption, masking, retention, and access controls must be revisited. Enterprises often forget that migration tooling is part of the production risk landscape.

Decommission discipline

The most expensive migration is the one that never finishes. Temporary bridges become permanent architecture fossils. Put explicit exit criteria on every compatibility component: when dual writes stop, when fallback reads are removed, when old topics are retired, when source tables become read-only, when the last consumer is cut over.

Tradeoffs

Rolling migration is safer than big-bang. It is also more expensive in the short term.

You trade a single cutover event for a prolonged coexistence period. That means duplicate infrastructure, more moving parts, and more operational burden. Your architecture becomes temporarily more complex so that your business risk becomes permanently lower.

That is a good trade in most enterprise environments. Not in all.

Dual writes can reduce cutover risk but increase consistency hazards. CDC avoids invasive source changes but captures storage events, not business intent. Reconciliation improves confidence but adds cost and often surfaces uncomfortable truths about source data quality. Kafka gives you replay and decoupling, but event ordering and retention strategy suddenly matter a lot more.

The deepest tradeoff is conceptual. To migrate incrementally, you often need temporary abstractions that are less pure than the destination architecture. Canonical events, translation layers, migration-only IDs, and facade routing are all compromises. Use them with discipline. Remove them when their job is done.

A migration plan that optimizes only for architectural purity usually fails in production. A migration plan that optimizes only for delivery speed usually leaves behind a distributed mess.

The craft is in knowing which impurity is temporary and which one will haunt you for years.

Failure Modes

There are predictable ways these efforts go wrong.

1. Table-first migration

The team maps schemas mechanically and assumes semantics will follow. They do not. The target service becomes a thin wrapper over a legacy data model, and the migration delivers technical motion without business simplification.

2. No explicit ownership model

Both old and new systems accept writes for too long. Divergence becomes normal. Nobody can state which system is authoritative for a given business fact. Support teams start “fixing” records in whichever UI is handy. The migration quietly dies.

3. Reconciliation treated as testing

Teams validate on pre-production samples and assume production will behave similarly. Then live edge cases, data corruption, and undocumented workflows create drift they cannot explain. Reconciliation must run continuously in production during the coexistence period.

4. Event semantics are weak

CDC messages are consumed as if they were domain events. Downstream services infer business meaning from low-level storage mutations. A harmless schema change in the source then breaks target interpretation.

5. Forgotten consumers

BI extracts, fraud models, operational scripts, partner feeds, and back-office tools often rely on old structures. They are discovered late, after routing has shifted. This is one reason façade-based migration and dependency inventories matter.

6. Temporary becomes permanent

Fallback reads remain for two years. Dual writes never stop. Legacy and target both persist the same concept indefinitely. Every future change now costs twice as much. This is architectural debt with interest.

When Not To Use

Rolling migration is not always the right answer.

Do not use it when:

the dataset is small and a planned outage is acceptable
the business process can tolerate a clean cutover weekend
the source system is already nearly isolated and has few consumers
the semantic model is unchanged and the move is purely infrastructural
the organization lacks the operational maturity to run coexistence safely

If you have a low-volume internal application, a straightforward export-transform-import with a short freeze may be saner. Not every problem needs Kafka, reconciliation fleets, and migration control planes. Architects earn their keep as much by declining complexity as by designing it.

Also, do not use rolling migration as a way to avoid making hard domain decisions. If the target bounded context is still vague, coexistence will simply amplify ambiguity. Migrate after clarifying ownership, not before.

Several architecture patterns commonly support rolling data migration.

Strangler Fig Pattern

The classic progressive replacement pattern. In data migration, it means redirecting capabilities gradually while legacy functionality remains in place.

Anti-Corruption Layer

Essential when source and target bounded contexts use different language. It prevents legacy semantics from contaminating the new model.

Change Data Capture

Useful when you need to propagate source changes without invasive source refactoring. Best used as a bridge, not a long-term domain integration strategy.

Event-Carried State Transfer

Helpful for propagating state snapshots through Kafka where downstream services need denormalized local views during migration.

Saga and compensation

Relevant when writes cross partially migrated capabilities and side effects must be coordinated without distributed transactions.

Materialized views

Often used to support shadow reads, comparison, and query-specific migration slices.

Reconciliation pattern

Underused and badly named. It deserves to be a first-class pattern in enterprise modernization. If you are running two truths for a while, you need a disciplined way to compare them.

Summary

Rolling data migration in microservices is not a background technical task. It is a deliberate transfer of business truth from one bounded context and ownership model to another, performed while the enterprise keeps moving.

That makes domain thinking non-negotiable. The migration has to follow business semantics, not storage structures. It has to use progressive strangler techniques so capabilities can shift gradually. It has to embrace Kafka, CDC, and asynchronous propagation where they help, but without pretending that movement of bytes is the same as preservation of meaning. And it absolutely must include reconciliation, because coexistence without comparison is just optimistic drift.

The shape of the answer is consistent across industries:

define target ownership by bounded context
backfill history with lineage
stream live change
reconcile continuously
shift reads before writes
cut over by capability
retire temporary bridges ruthlessly

The line I come back to is this: during migration, the migration is the system.

Treat it with the same design rigor, observability, and operational discipline you would give any production platform. If you do, rolling migration becomes a controlled modernization path instead of an expensive act of hope. If you do not, you will simply distribute your legacy problems across more services, more topics, and more dashboards.

And nobody needs a more modern mess.

Frequently Asked Questions

What is a service mesh?

A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.

How do you document microservices architecture for governance?

Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.

What is the difference between choreography and orchestration in microservices?

Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.