Incremental Refactoring in Microservices | NILUS

⏱ 21 min read

There is a particular kind of enterprise failure that doesn’t arrive with a bang. It arrives with a steering committee.

Someone says, “We’ll modernize the platform.” A slide appears with boxes labeled customer service, order service, billing service. Arrows fly everywhere. Kafka appears in the middle like holy water. The monolith is declared “legacy,” the future is declared “microservices,” and a migration program is born with all the optimism of a city planning a ring road.

Then reality turns up.

The monolith is not one thing. It is twenty years of pricing exceptions, half-forgotten workflows, tacit operational knowledge, and business semantics hidden in batch jobs with names like finalize_v3b. It is ugly, yes. But ugly systems are often the ones keeping the company alive. Replacing them outright is not architecture. It is gambling dressed as strategy.

That is why incremental refactoring architecture matters. Not as a fashionable compromise, but as the only serious way to move a live enterprise from a tightly coupled system to a service-based landscape without setting fire to revenue recognition, customer support, and operations in the process.

This article is about that approach: how to migrate from a monolith to microservices through progressive strangler refactoring, why domain semantics matter more than technology choices, where Kafka helps and where it causes trouble, and how to reconcile two worlds while they coexist. This is not a fairy tale about decomposition. It is a working method for architects who must modernize systems that already matter. event-driven architecture patterns

Context

Most large organizations do not begin with a clean domain model and a neat set of bounded contexts. They begin with one successful application, then another, then integration glue, then reporting marts, then emergency code, then a merger, then three years of “temporary” APIs. Over time, the system becomes a sedimentary rock of business change.

That accumulated software often has real strengths. It centralizes core logic. It offers transactional consistency. It has known operational behavior. It supports awkward but valuable business capabilities that no one wants to rediscover from scratch.

But it also becomes a drag anchor.

Release cycles slow down. Teams collide in the same codebase. A change in customer onboarding unexpectedly breaks invoicing. Scaling one hot workflow means scaling everything. Regulatory changes require touching too many unrelated parts. New channels and products become expensive because the architecture no longer matches the business.

This is where microservices migration usually enters the conversation. Unfortunately, the conversation is often too technical. Teams discuss Docker, service meshes, Kafka, and CI/CD pipelines before they can answer the foundational question: what business capability are we separating, and where does its domain boundary truly lie?

Incremental refactoring architecture starts there. It assumes the existing system contains useful business truth. It treats migration as a sequence of carefully chosen extractions guided by domain-driven design, not by a generic decomposition template.

The key shift is this: you are not “moving to microservices.” You are rearranging the system so that software boundaries align more closely with business boundaries, team ownership, and operational realities. microservices architecture diagrams

That sounds obvious. In enterprises, it rarely is.

Problem

The central problem is not that the monolith is old. The problem is that it concentrates too many responsibilities behind a single deployment and a single data gravity well.

That creates several painful dynamics:

Tight coupling of change: unrelated business changes compete in the same release train.
Poor domain clarity: concepts like customer, account, order, invoice, policy, shipment, or entitlement mean different things in different parts of the system.
Operational bottlenecks: one high-volume workflow drives infrastructure choices for the whole platform.
Data entanglement: tables become shared contracts, which is another way of saying no one can change anything safely.
Migration fear: because the system is critical, leadership wants certainty; because certainty is impossible, teams either freeze or attempt a reckless rewrite.

The rewrite temptation is especially dangerous. A full replacement promises conceptual cleanliness. It also demands complete rediscovery of business behavior, often under time pressure and with partial understanding. This is how organizations lose edge-case handling, hidden controls, and operational nuance.

An enterprise system is not just code. It is a treaty between departments, auditors, customer service agents, and nightly jobs. If you replace it in one move, you must renegotiate the treaty in one move too.

That almost never ends well.

Forces

Good architecture is shaped by forces, not slogans. In incremental migration, several forces pull against one another.

Business continuity versus architectural improvement

The business wants modernization but cannot tolerate disruption. Revenue, compliance, customer experience, and reporting must continue while the architecture changes underneath. This makes migration a problem of coexistence, not replacement.

Domain purity versus legacy reality

DDD teaches us to model bounded contexts around business capabilities. Correct. But the legacy system was not built that way, and the business may not even use language consistently. Architects must discover domain boundaries while respecting existing process seams, organizational ownership, and data realities.

Autonomy versus consistency

Microservices promise independent change. But some business operations require strong consistency, especially in financial, inventory, or regulated processes. The deeper truth is that autonomy always has a cost. You pay for it in reconciliation, eventual consistency, duplicate data, and operational complexity.

Speed versus control

Incremental extraction should accelerate delivery for high-change domains. Yet each extracted service introduces network failure, schema evolution, deployment coordination, observability needs, and production runbooks. Moving too fast creates a distributed monolith. Moving too slowly preserves the old one.

Event-driven decoupling versus semantic confusion

Kafka and event streaming are useful in migration because they let systems coexist without direct lockstep calls. But events are not magic. If teams have not agreed on what an “OrderPlaced” or “CustomerUpdated” event means, then Kafka merely distributes ambiguity at scale.

This is why domain semantics matter so much. Integration technology can carry messages. It cannot create meaning.

Solution

The practical solution is a progressive strangler migration driven by domain boundaries, backed by anti-corruption layers, event-based integration where appropriate, and explicit reconciliation capabilities during coexistence.

The old pattern still holds because it reflects how enterprises actually change: one capability at a time, with traffic gradually shifting from old implementation to new. But the pattern needs a more modern and more disciplined interpretation.

At a high level, the architecture evolves in stages:

Identify bounded contexts and seams in the monolith.
Extract one business capability that has clear ownership and high change pressure.
Place an anti-corruption layer between the old and new worlds so semantics are translated, not leaked.
Publish or capture business events to synchronize state where full transactional replacement is not yet possible.
Run old and new in parallel with reconciliation controls.
Shift write ownership for the capability to the new service.
Retire the corresponding monolith functionality only after proving operational correctness.

This is not merely technical extraction. It is a transfer of business authority.

A service should not exist because a team wants smaller deployables. It should exist because a business capability deserves a clear model, a responsible team, and an explicit contract with the rest of the enterprise.

The role of domain-driven design

DDD is especially valuable here because migration exposes semantic confusion. In many monoliths, “customer” means legal entity in one module, billing party in another, and signed-in user in a third. If you extract services without resolving that ambiguity, you get three APIs, five Kafka topics, and the same confusion in a more expensive shape.

Bounded contexts help you decide where language is consistent enough to support autonomy. Context mapping helps you understand where translation is unavoidable. Event storming, process mapping, and domain workshops are not ivory-tower exercises here; they are migration tools.

The right service boundary is often found where:

business rules are cohesive,
change is frequent,
ownership can be clear,
data can be independently governed,
and integration can tolerate asynchronous coordination.

The wrong boundary is often found where:

the process is heavily transactional across domains,
semantics are still unresolved,
teams are not ready to own operations,
or the service would merely front a shared database.

Architecture

An incremental refactoring architecture usually has five major elements:

Legacy monolith that continues to run core processes.
Strangler facade or routing layer that controls which requests go to old or new implementations.
New domain services with clear bounded contexts and owned data.
Integration backbone, often Kafka for event propagation and decoupled synchronization.
Reconciliation and observability mechanisms to detect and correct divergence.

Here is a simplified view.

This picture hides the hard part, which is semantics. The anti-corruption layer matters because the monolith’s model is often not fit to become the enterprise standard. If you let legacy schemas and legacy object shapes leak into new services, you haven’t extracted a service. You’ve exported old coupling.

Strangler facade

The facade routes traffic and enforces migration policy. Early on, most requests still hit the monolith. Over time, specific endpoints, commands, or workflows are routed to new services. Sometimes routing is based on functionality; sometimes on tenant, geography, product line, or account cohort.

This controlled redirection is what makes incremental migration survivable. It provides rollback options. It lets teams canary production behavior. It supports measured traffic shifts.

Anti-corruption layer

This is one of the least glamorous and most important components in the whole architecture. The anti-corruption layer translates between domain models so that new services can remain coherent. It may transform commands, reshape responses, enrich data, or map event structures.

Without it, the new world inherits the old world’s language debt.

Kafka and event propagation

Kafka is often useful because migration creates a period where multiple systems need timely visibility of business changes. A newly extracted service may need to react to changes originating in the monolith; the monolith may also need data produced by new services.

Event streaming supports this coexistence, but only when event design is disciplined:

distinguish business events from technical CDC noise,
version schemas carefully,
preserve ordering only where it is genuinely required,
and design consumers for replay and idempotency.

In migration, Kafka is often most valuable as a backbone for decoupled synchronization, not as proof that the enterprise is now “event-driven.”

Reconciliation as first-class architecture

This point deserves bluntness: during incremental migration, state divergence is not an exception. It is a design assumption.

If orders can be created in one system and invoicing updated in another, if customer preferences move first while account data stays behind, if asynchronous updates can be delayed or replayed, then reconciliation is not an operational afterthought. It is part of the architecture.

You need:

canonical identifiers across systems,
traceable event lineage,
compensating workflows,
discrepancy reports,
replay capabilities,
and business-owned exception handling.

A migration without reconciliation is optimism pretending to be design.

Migration Strategy

The strategy should be progressive, measurable, and explicitly tied to business capability extraction. A good sequence looks less like a big-bang roadmap and more like a campaign plan.

1. Discover seams

Start with domain analysis, production behavior, release pain, and organizational ownership. Look for capabilities with:

high business change,
low to medium dependency depth,
painful release coordination,
and understandable boundaries.

Customer notifications, pricing configuration, product catalog, case management, and onboarding are common candidates. Core ledger posting or inventory allocation may not be.

2. Choose the first extraction carefully

The first service is a political and technical signal. Choose one that is meaningful but survivable. Too trivial, and the organization learns nothing. Too central, and the migration burns credibility.

A good first extraction usually has:

visible business value,
manageable integration points,
modest transaction complexity,
and a team capable of owning development and operations.

3. Build the target model, not a remote monolith endpoint

The new service should have its own domain model, storage, API contracts, and event semantics. If it simply wraps legacy tables or mirrors monolith object graphs, you are creating a future problem.

4. Introduce read-side coexistence

Often the safest path is to let the new service build its own read model first. It can consume monolith-originated events, read through an ACL, or ingest snapshots. This allows teams to validate domain understanding before taking write ownership.

5. Parallel validation

Before switching writes, compare outputs between old and new implementations. This can be done with shadow traffic, dual computation, sampled workflow replay, or side-by-side business report comparison.

This phase often reveals hidden rules no one documented.

6. Shift write ownership

Once confidence is established, direct write operations for that capability to the new service. The monolith becomes a consumer of resulting state changes rather than the system of record for that capability.

This is the real milestone. A service becomes real when it owns a decision and the state behind it.

7. Retire legacy code

Only after write ownership is stable, reconciliation is clean, and downstream consumers are adapted should you remove monolith behavior. Enterprises are too fond of celebrating the deploy and too casual about deleting the old path. Until the old path is gone, complexity remains.

Migration slices: horizontal and vertical

There are two common migration cuts:

Horizontal technical slices, such as moving all reads to a new layer.
Vertical business slices, such as extracting returns processing end to end.

Vertical slices are usually better because they align with domain ownership and produce clearer outcomes. Horizontal slices can help bootstrap platforms, but they often postpone the real semantic work.

Enterprise Example

Consider a global insurance company migrating a policy administration platform built over fifteen years. The monolith handled quotes, policy issuance, endorsements, billing interactions, documents, and claims handoffs. Everything was integrated through one relational database and a thicket of nightly jobs.

Leadership wanted microservices. Fair enough. But the first instinct was to carve the monolith into technical services around CRUD domains: customer, policy, product, payment, document. That would have produced lots of APIs and very little autonomy, because the actual business workflow for policy issuance crossed all of them in tightly coupled ways.

A better approach started with domain semantics.

Workshops revealed that what the business called “policy” actually mixed several bounded contexts:

Quote: highly iterative, sales-oriented, rapidly changing.
Policy Administration: contractual and regulated.
Billing Agreement: financially governed with different lifecycle rules.
Document Production: template-driven and operationally bursty.

That changed the migration plan completely.

The company extracted Quote first, not because it was the most central domain, but because it had the highest change rate, separate product-team ownership, and looser consistency needs than policy issuance. A new quote service was built with its own data store and APIs. Product rating rules were externalized. Kafka events published quote lifecycle milestones for downstream consumers.

The monolith remained the system of record for policy issuance. An anti-corruption layer translated quote outcomes into the legacy policy model. During coexistence, the quote service published events like QuotePriced, QuoteAccepted, and QuoteExpired. The monolith consumed these and continued downstream issuance processing.

This worked because the business decision “what quote did we offer?” could be separated from the contractual decision “what policy did we issue?” Similar words, different semantics.

Later, the company introduced a new policy issuance service for one product line only. Traffic routing in the facade sent travel insurance through the new path while commercial property remained on the monolith. Reconciliation compared premium calculations, document outputs, and billing setup between the old and new flows. Where mismatches occurred, operations teams had dashboards showing correlation IDs across Kafka topics, policy records, and billing messages.

The migration was not pretty. Event versions drifted. A batch process in the monolith still updated quote statuses in some edge cases, causing divergence. One downstream claims system assumed policy numbers were assigned synchronously, which broke when issuance became asynchronous for the new path.

But the program succeeded because it treated these as architecture concerns, not project noise. The teams had explicit ownership, semantic boundaries, and reconciliation controls. After three years, quote and several product-specific issuance flows were off the monolith, while high-risk regulated accounting remained centralized pending a different strategy.

That is incremental refactoring in the real world: uneven progress, selective extraction, disciplined semantics, and enough humility to leave some things alone until the enterprise is ready.

Operational Considerations

Distributed systems are easy to draw and expensive to run. Migration architectures are worse because they must run both the old and new worlds simultaneously.

Observability

You need end-to-end traceability across synchronous requests, asynchronous events, and legacy jobs. Correlation IDs are non-negotiable. So are structured logs, service metrics, lag metrics for Kafka consumers, discrepancy counters, and business-level dashboards.

Do not settle for “the topic is healthy.” You need to know whether accepted orders in one system become invoices in another within expected time bounds.

Idempotency and replay

During migration, duplicates happen. Retries happen. Replays happen. Consumers must be idempotent. Business operations must tolerate repeated delivery where feasible. If replaying a topic can issue duplicate refunds, your architecture is not mature enough for event-based coexistence.

Data governance

As services take ownership of data, definitions and stewardship matter more, not less. Enterprises often assume that distributed ownership reduces governance needs. It increases them. Shared identifiers, retention rules, privacy controls, and audit obligations must be coordinated. EA governance checklist

Reconciliation operations

Someone must own discrepancy management. Often this lands in a twilight zone between engineering, operations, and business support. That is a mistake. Reconciliation needs named owners, triage workflows, SLA targets, and tools for correction.

Release strategy

Because domains are entangled during migration, deployment independence is relative. Use feature toggles, traffic routing, consumer-driven contract tests, schema compatibility rules, and rollback playbooks. New services should be deployable independently, but migration milestones still need cross-team coordination.

Tradeoffs

Incremental refactoring is the least reckless path, not the cheapest or simplest one.

What you gain

reduced migration risk,
earlier business value,
clearer domain ownership,
selective modernization,
and the ability to learn from production before committing fully.

What you pay

prolonged coexistence,
duplicate logic during transition,
reconciliation complexity,
distributed operational burden,
and organizational fatigue if the program drags on.

The most painful tradeoff is often temporary duplication. For a period, the monolith and new services may both compute related outcomes, store overlapping data, or expose similar APIs. Purists dislike this. Enterprises need it. Duplication is often the price of validation and safe transfer of authority.

Another tradeoff is consistency. Extracting services usually means some interactions become asynchronous. That can improve resilience and decoupling, but it also introduces timing windows and correction workflows. If the business cannot tolerate those windows, the boundary may be wrong or the process may need a different design.

Failure Modes

Most migration programs do not fail because microservices are inherently bad. They fail because organizations underestimate semantic and operational complexity.

Here is the usual wreckage.

1. Distributed monolith

Teams extract services but retain shared databases, synchronized deployments, and chatty runtime calls. The result has all the network failure of microservices with none of the autonomy.

2. Wrong service boundaries

Boundaries are chosen by org chart, UI screens, or database tables rather than business capability. Services end up slicing straight through cohesive domain logic.

3. Event theater

Kafka is introduced, but events are either raw table-change streams or ambiguous state dumps. Consumers become tightly coupled to producer internals. The enterprise gains topics, not decoupling.

4. No reconciliation strategy

Leaders assume eventual consistency will “work itself out.” It won’t. Divergence accumulates until finance, support, or compliance discovers it first.

5. First extraction is too hard

A team tries to extract the most business-critical and transaction-heavy domain first. Momentum dies in integration complexity.

6. Legacy semantics leak everywhere

The new services use the old language, identifiers, lifecycle states, and schema assumptions. Technical packaging changes, domain confusion remains.

7. Migration never finishes

The coexistence phase becomes the permanent architecture. This happens when teams do not define retirement criteria for legacy slices and do not fund decommissioning work explicitly.

Here is a simple before-and-after view worth keeping in mind.

7. Migration never finishes — Migration never finishes

Notice the middle state is more complex than either endpoint. That is normal. Migration architecture is a bridge. The mistake is to build the bridge and then live on it forever.

When Not To Use

Incremental refactoring is powerful, but it is not universal.

Do not use it when the system is small enough to replace safely in one bounded effort. A 50-person enterprise app with modest complexity does not need a seven-phase strangler program and Kafka in the middle.

Do not use it when the dominant problem is not architecture but product confusion. If the business cannot agree on core concepts, decomposing the software will harden confusion into contracts.

Do not use it when the organization cannot support operational ownership. Microservices are not just smaller codebases; they are independent run responsibilities.

Do not use it for domains that require tight, high-volume, low-latency transactions across multiple extracted boundaries unless you have a very strong reason and a very mature team. Sometimes a modular monolith is the better answer.

And do not use it when leadership wants the optics of modernization without funding the long coexistence period. Incremental migration without patience becomes permanent half-migration.

Sometimes the right move is to first reshape the monolith internally: modularize it, clarify domain boundaries, isolate data access, and create internal APIs. A monolith with good boundaries is often a better starting point than prematurely distributed services.

Several patterns commonly travel with incremental refactoring architecture:

Strangler Fig Pattern: replace behavior gradually by routing around old implementation.
Anti-Corruption Layer: protect new models from legacy semantics.
Bounded Contexts: identify where language and rules are cohesive.
Context Mapping: define relationships and translations between domains.
Transactional Outbox: publish reliable events from services without dual-write hazards.
Saga / Process Manager: coordinate long-running workflows across services.
CQRS: useful where read and write concerns evolve differently during migration.
Change Data Capture: sometimes useful for bootstrapping read models, but dangerous if treated as a business event strategy.
Modular Monolith: often the right stepping stone before full distribution.

The best architects do not choose these patterns by fashion. They choose them by force fit. Every pattern is a tool, and every tool can become a weapon in the wrong hands.

Summary

Incremental refactoring architecture is the grown-up way to migrate from a monolith to microservices in an enterprise. It accepts a simple truth: valuable systems cannot be replaced as if they were diagrams. They must be untangled while still running.

The heart of the method is not Kubernetes, Kafka, or any other platform choice. It is domain thinking. Know where the business boundaries are, where language changes meaning, where consistency matters, and where autonomy is worth its operational cost.

Use progressive strangler migration to extract one capability at a time. Put anti-corruption layers between old and new models. Use Kafka and events where decoupled synchronization genuinely helps. Design reconciliation as a first-class capability because coexistence creates divergence. Shift write ownership deliberately. Delete legacy paths when the new authority is proven.

Be honest about the tradeoffs. The middle state will be messier than the start. Some domains should stay put longer. Some should never become separate services. And some organizations should stop at a modular monolith because that is the architecture their context can actually support.

In enterprise architecture, courage is overrated. Disciplined patience wins more often.

The best migration is not the one with the cleanest target slide. It is the one that gets the business across the river without pretending the river isn’t there.

Frequently Asked Questions

What is a service mesh?

A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.

How do you document microservices architecture for governance?

Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.

What is the difference between choreography and orchestration in microservices?

Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.