⏱ 20 min read
There is a particular kind of fear that shows up in enterprise modernization programs. It is not the fear of building something new. Teams do that every quarter. It is the fear of replacing something old that still makes money every day.
That old order management platform may be ugly. It may require ritual sacrifice to deploy. It may encode business rules in places no one admits to understanding. But it clears trades, ships products, prices policies, settles invoices, and closes the month. It is not merely software. It is institutional memory wearing a bad user interface.
This is why parallel run matters.
A parallel run architecture is what serious organizations reach for when “big bang” is too dangerous and simple coexistence is too weak. You run the legacy and the new world side by side, feed them the same business events, compare behavior, reconcile outcomes, and only then shift authority. It is not elegant in the way a greenfield system is elegant. It is more like building a second engine while the plane is flying.
And that is exactly why it deserves respect.
In microservices migration, parallel run is not just a deployment tactic. It is an architecture decision that shapes domain boundaries, event design, operational controls, data ownership, and the politics of trust. If you get it right, you buy learning without betting the firm. If you get it wrong, you create a costly duplicate reality that no one can explain at audit time.
This article looks at parallel run architecture through the lens of enterprise microservices migration, with domain-driven design, Kafka-style event flows, progressive strangler migration, reconciliation, and the hard tradeoffs that architects usually discover too late. event-driven architecture patterns
Context
Most large enterprises are not migrating from a tidy monolith to a clean set of services. They are migrating from accumulated business behavior. The code is only the visible part.
A bank may have one “payments platform,” but in domain terms it often contains several bounded contexts smashed together: payment initiation, sanction screening, settlement, fees, treasury visibility, dispute handling, ledger posting, and reporting. A retailer’s order platform may mix pricing, availability, fulfillment orchestration, customer communication, and invoicing. A manufacturer’s ERP customization may quietly own planning semantics that no one has named.
That matters because parallel run does not compare systems at the level of technology. It compares them at the level of business meaning.
If the old platform thinks an “order accepted” status means “credit check passed but inventory still provisional,” while the new microservice interprets it as “ready for allocation,” you do not have a technical mismatch. You have a domain semantics problem. Running both systems in parallel will expose it brutally. microservices architecture diagrams
This is why the best parallel run architectures start with domain language. Before deciding on Kafka topics, dual writes, shadow traffic, or reconciliation jobs, you need to know what business facts are being asserted, which bounded context owns them, and what equivalence means across old and new models.
Too many migration programs skip this and jump straight to plumbing. Then they discover, six months in, that both systems are “correct” according to different business assumptions. That is not migration. That is organized confusion.
Problem
A legacy platform is business-critical, but it has become a bottleneck. Delivery is slow. Change failure rate is high. Scaling is expensive. Teams are tangled. Every enhancement requires cross-module coordination and tribal knowledge.
The target state is typically a set of microservices aligned to business capabilities, often backed by event-driven integration. Kafka enters the picture because enterprises need durable event streams, replay, decoupling, auditability, and integration between old and new estates. It is a good fit for the migration path precisely because migration is less about APIs than about business events moving through time.
But the central problem remains stubborn:
How do you replace a system that cannot be wrong, with a system that is not yet trusted?
A simple cutover sounds attractive on slide decks. In reality it is often reckless. The business impact of subtle defects is too high. Pricing edge cases, tax calculation drift, duplicate settlements, broken entitlements, missing notifications, and timing differences in asynchronous flows can all hurt customers before technical monitoring notices anything.
Parallel run answers this by delaying final authority. The new platform processes the same business inputs as the old one. Outputs are compared. Differences are analyzed. Confidence grows through evidence, not optimism.
Still, this introduces a second problem: running two systems at once creates duplication, cost, and semantic drift if you do not control it. Architects need a way to make parallel run intentional rather than accidental.
Forces
Several forces pull in different directions here.
Safety versus speed
The organization wants rapid migration but cannot tolerate business disruption. Parallel run reduces cutover risk, but it slows the path to simplification. You are carrying two worlds, which means two operational surfaces, two data interpretations, and often two support paths.
Domain fidelity versus technical isolation
Microservices migration encourages decomposition around bounded contexts. But the legacy system usually bundles behavior in ways that do not map neatly. If you decompose too early, you may lose behavioral fidelity. If you keep too much of the old model, the new services inherit monolithic semantics.
Event-driven elegance versus legacy reality
Kafka and event-driven architecture are excellent for replayable migration pipelines and asynchronous propagation. But many legacy systems are not naturally event-native. They rely on batch updates, hidden side effects, synchronous validations, and mutable records. Extracting trustworthy domain events can be harder than building the consumers.
Validation depth versus operational complexity
You can validate at several levels: API response comparison, event comparison, persisted state comparison, financial reconciliation, customer-visible output comparison. The more levels you validate, the more confidence you gain. You also increase complexity, latency, storage, observability burden, and support overhead.
Independent ownership versus business consistency
Teams want service autonomy. The enterprise wants one version of financial truth. During parallel run, autonomy must often yield to coordinated governance. Event schemas, comparison rules, reconciliation tolerances, idempotency, and cutover criteria become enterprise concerns. EA governance checklist
Cost versus evidence
Parallel run is expensive. Additional infrastructure, duplicated compute, reconciliation processes, data retention, support staffing, and migration tooling all cost real money. But evidence is what buys trust. In regulated sectors, evidence is often more valuable than speed.
Solution
At its heart, parallel run architecture means this:
- Capture the same business intent once.
- Route that intent to both legacy and new processing paths.
- Observe outputs and state transitions independently.
- Reconcile differences at the business level.
- Shift authority gradually from legacy to new services.
This sounds simple. It isn’t. The quality of the architecture depends on where you branch, what you compare, and who is authoritative at each stage.
A good parallel run architecture usually has these characteristics:
- A canonical business event stream that represents intent in domain language.
- Explicit bounded contexts so comparison happens where semantics are coherent.
- Separate authority from participation: the new service may process transactions before it becomes the source of record.
- A reconciliation capability that can compare outcomes, classify divergence, and support remediation.
- A progressive strangler migration path so responsibility shifts capability by capability, not all at once.
- Observability and auditability built in from day one.
The most important idea is this: do not run systems in parallel just to compare bytes. Compare meaning.
If the old platform emits “invoice total = 103.42” and the new one emits “invoice total = 103.41,” that difference is just noise until you understand whether rounding policy, tax jurisdiction timing, discount precedence, or FX rate source caused it. Reconciliation must be domain-aware. Otherwise teams drown in false mismatches and lose confidence in the whole exercise.
Architecture
The architecture generally revolves around three streams: command or intent, processing outcomes, and reconciliation.
A common pattern is to use Kafka as the event backbone. A channel captures business intent—say, OrderPlaced, PaymentInitiated, or ClaimSubmitted. That event is then consumed by both the legacy adaptation layer and the new microservices. Each path produces outcomes and state changes. A reconciliation service compares them, flags acceptable deltas, and escalates true mismatches.
Here is the conceptual shape.
This diagram hides the hard part, which is semantic alignment. The ingress layer should not merely repackage HTTP requests into Kafka messages. It should emit domain events with enough structure to preserve intent across both worlds. If the event is underspecified, each side will enrich or interpret it differently, and reconciliation becomes guesswork.
Domain-driven design thinking
This is where domain-driven design earns its keep.
Parallel run only works well when the migration is aligned to bounded contexts. If your new architecture defines a PricingService, InventoryService, and OrderService, but the legacy platform computes availability during pricing and applies promotional logic during fulfillment release, you cannot just map modules one-to-one. You need to identify the true business capabilities and the invariants they protect.
A bounded context gives you a semantic frame for comparison. For example:
- In the Pricing context, equivalence may mean identical gross amount, discount reason, tax basis, and promotion set.
- In the Order Management context, equivalence may mean the same lifecycle decision and reservation outcome, even if internal steps differ.
- In the Payments context, equivalence may mean the same authorization result and ledger impact, not necessarily the same intermediate event sequence.
This is critical. Parallel run should compare outcomes at the right boundary, not enforce identical implementation behavior. Legacy and new systems often take different routes to the same business result. Demanding identical internal states is a category error.
Shadow, validate, then own
In practice, architectures tend to move through stages:
- Shadow mode: new services process live inputs but their outputs are non-authoritative.
- Validated mode: new outputs are compared routinely and approved by reconciliation.
- Selective authority: specific capabilities or cohorts are served by new services.
- Full ownership: legacy path is bypassed or retained only as fallback.
That progression is the architecture, not just the rollout plan.
Data architecture and state comparison
There are three broad ways to compare old and new:
- Event comparison: compare emitted business events.
- State comparison: compare persisted business state after processing.
- Outcome comparison: compare customer-visible or financially material results.
For enterprise use, outcome comparison is usually the anchor. Event and state comparison are useful diagnostics, but they are implementation-dependent. What matters to the business is whether the customer got the same quote, the same shipment promise, the same payment disposition, the same invoice, the same ledger posting.
Still, state comparison becomes essential for migration confidence where downstream processes are sensitive to hidden fields and timing. You often need a canonical comparison model that normalizes both representations before matching.
Reconciliation as a first-class component
Reconciliation is not a report. It is part of the architecture.
A proper reconciliation capability should:
- correlate transactions across old and new paths
- support matching windows for asynchronous timing differences
- normalize data into comparable domain views
- classify mismatches by business severity
- tolerate known acceptable variance
- create remediation workflows
- maintain an audit trail
If you leave reconciliation to ad hoc SQL scripts and spreadsheets, your migration will turn into archaeology.
Migration Strategy
A sensible migration strategy uses progressive strangler migration, not broad replacement. The key is to strangle by domain capability, while using parallel run to validate each transfer of responsibility.
The sequence often looks like this:
1. Establish a stable ingress
Create a command or event ingress point that captures business intent independently of the legacy internal model. This might be a new API layer, channel integration layer, or event gateway. Its purpose is to decouple incoming demand from the legacy system’s shape.
Without this, every migration step remains hostage to legacy interfaces.
2. Define canonical domain events
Do the hard semantic work. What exactly is an order submitted? When is a payment authorized? What fields represent intent versus execution? Which context owns customer credit status? These definitions should be versioned and governed, but not bureaucratized to death.
Events should reflect business facts, not database deltas.
3. Introduce a legacy adapter
Instead of forcing the old platform to become event-native, build an adapter that translates canonical events into legacy-compatible commands or transactions and emits outcome events from legacy responses. This isolates legacy quirks and gives you a better cut point.
4. Build one bounded context at a time
Do not rebuild the whole monolith under a microservices label. Pick a bounded context where semantics are reasonably coherent and business value is visible. Pricing, customer communications, document generation, or inventory promise are often viable early candidates. General ledger or highly coupled settlement engines are usually not.
5. Run in shadow mode
Let the new service process the same demand. Publish its outcomes, but do not let them drive the business yet. Build confidence in behavior, performance, and operability.
6. Reconcile and learn
This is where migration gets real. Mismatches should lead to one of several outcomes:
- fix a defect in the new service
- identify a hidden legacy rule
- refine event semantics
- classify acceptable variance
- decide the domain boundary was wrong
Parallel run is a discovery machine if you let it be.
7. Shift authority progressively
Once mismatch rates are low and failure handling is solid, route a subset of traffic or a subset of capability decisions to the new service. This can be by customer cohort, geography, product line, channel, or workflow step.
8. Retire the old capability, not just the code path
A service is not truly migrated when requests stop flowing. It is migrated when old reports, support processes, operational runbooks, controls, and dependent jobs are also retired or redirected.
That is where many “migrations” quietly stall.
Enterprise Example
Consider a large insurer modernizing claims processing.
The legacy platform was a 20-year-old suite running FNOL intake, coverage checks, reserve calculations, fraud screening triggers, payment approvals, and downstream ledger postings. It had dozens of integrations and nightly batch jobs that fed finance and regulatory reporting. The business wanted faster product changes, more automation, and clearer service ownership.
A big-bang rewrite would have been irresponsible. Claims are not a place for architectural bravado.
The insurer chose a progressive strangler migration with parallel run, centered on Kafka.
Domain decomposition
Through event storming and domain analysis, they identified several bounded contexts:
- Claims Intake
- Coverage Validation
- Reserve Estimation
- Fraud Assessment
- Payment Decisioning
- Financial Posting
Crucially, they discovered the legacy “claim status” field was overloaded beyond belief. In one workflow it represented triage completion. In another it implied adjuster assignment. In finance it was used as a proxy for payment eligibility. Treating that field as a canonical business state would have poisoned the migration.
So they replaced it with explicit domain events such as ClaimRegistered, CoverageConfirmed, ReserveCalculated, FraudScoreAssigned, and PaymentApproved.
That was the turning point. Once they stopped pretending the legacy status code was meaningful across contexts, the migration made sense.
Parallel run implementation
A new ingress layer accepted claim events from channels and published them to Kafka. A legacy adapter translated those into transactions against the old suite. New microservices for Claims Intake and Fraud Assessment consumed the same events in shadow mode.
Outcomes from both paths were normalized into a comparison model. The reconciliation service matched by claim ID and processing window, then categorized differences:
- harmless timing differences
- expected model improvements
- genuine business mismatches
- upstream data quality issues
The fraud service produced more granular risk reasons than the legacy engine. That was fine. Reconciliation was configured to compare the decision class and threshold alignment, not text explanations.
Later, Payment Decisioning was introduced. This was harder because financial consequences were real. They used dual processing but kept the legacy suite as the source of record. New payment decisions were compared not just for approval outcome but for reserve impact, payment amount, and ledger entry composition.
Only after three month-end closes with acceptable reconciliation did they allow the new service to authorize low-risk claim payments for one product line.
The pattern looked like this:
What they learned
First, domain semantics beat technical symmetry. The migration accelerated only after they made business meaning explicit.
Second, reconciliation volume was much higher than expected. For every true defect, there were many differences caused by timing, rounding, enrichment sources, and implicit legacy defaults. The team had to invest heavily in diff classification.
Third, the old platform’s side effects were more dangerous than its core logic. Batch jobs that generated letters, updated reserves, and posted ledger entries caused more migration trouble than claim registration itself. This is common. The visible transaction is not the whole system.
In the end, the insurer retired intake and fraud from the legacy suite, reduced release coordination, and improved auditability because every major domain event was on Kafka with replay support. But they did not migrate everything. Financial posting remained partially centralized for longer because accounting control demanded a more conservative path.
That was the right call. Good architecture is not maximalist. It knows where to stop.
Operational Considerations
Parallel run architecture lives or dies in operations.
Correlation and traceability
Every transaction needs a stable correlation identifier across both paths. Not a best-effort one. A guaranteed one. Without it, reconciliation becomes probabilistic and support teams lose hours stitching stories together.
Idempotency
Duplicate event delivery is not a corner case in event-driven systems. It is normal weather. Both legacy adapters and new services must be idempotent or protected with deduplication controls. Otherwise parallel run becomes a duplicate-payment factory.
Time windows and eventual consistency
Legacy and new systems will often complete at different times. Reconciliation must account for processing windows, retries, and delayed downstream effects. Comparing too early creates false alarms; comparing too late hides defects.
Observability
You need business observability, not just CPU graphs. Track:
- match rate by bounded context
- divergence rate by type
- reconciliation latency
- authoritative-path success rate
- replay volume
- fallback usage
- customer-visible incident correlation
Replay and recovery
Kafka helps because event replay can reconstruct processing and support backfills. But replay must be carefully governed. Replaying into side-effecting systems without safeguards can trigger duplicate notifications, duplicate postings, or re-opened workflows.
Control and audit
In regulated enterprises, auditors will ask who was authoritative when, what evidence justified each cutover step, how mismatches were handled, and whether financial reports remained consistent. Parallel run should make this easier, not harder. Keep versioned comparison rules, signed-off tolerances, and explicit cutover decisions.
Tradeoffs
Parallel run is powerful, but let’s not romanticize it.
What you gain
- lower cutover risk
- empirical validation
- visibility into hidden legacy behavior
- safer incremental strangler migration
- stronger audit trail
- ability to test new services on live demand
What you pay
- duplicate infrastructure and operations
- more complex event and data models
- higher support burden
- longer migration timelines
- reconciliation engineering overhead
- organizational fatigue
The biggest hidden tradeoff is cognitive load. Teams are not just building a new system. They are interpreting two systems at once. This can exhaust delivery teams if the program runs too long.
A parallel run should therefore be designed with an exit in mind. It is a bridge, not a lifestyle.
Failure Modes
This pattern fails in recognizable ways.
1. Comparing the wrong thing
Teams compare raw payloads or database records instead of business outcomes. The result is endless noise and little confidence.
2. Weak domain model
Canonical events are too generic, too technical, or too close to the legacy schema. New services then encode old ambiguity in distributed form.
3. Unmanaged side effects
Emails, payments, ledger postings, and document generation are accidentally triggered from both systems. Nothing destroys trust faster than duplicate customer-visible actions.
4. Reconciliation as an afterthought
If reconciliation is bolted on late, teams cannot explain mismatches or prove correctness. Migration stalls under a cloud of uncertainty.
5. No explicit authority model
If it is unclear which system is authoritative for which decision at which time, downstream consumers build their own assumptions. Then cutover becomes political instead of technical.
6. Parallel forever
The organization gets comfortable with “temporary” duplication. Costs rise, ownership blurs, and no one can decommission the old estate because too many controls still depend on it.
7. Event stream inconsistency
Kafka is useful, but not magical. Poor partitioning strategy, missing ordering assumptions, schema drift, or weak exactly-once expectations can undermine confidence quickly.
A hard truth: if the enterprise lacks operational discipline, parallel run can amplify chaos rather than contain it.
When Not To Use
Parallel run is not always the right move.
Do not use it when:
- the domain is low risk and easy to roll back
- the legacy system cannot be safely exercised in parallel
- side effects are too expensive or dangerous to duplicate, and cannot be isolated
- the migration scope is small enough for a simpler strangler or branch-by-abstraction approach
- event capture is unreliable and reconciliation would be untrustworthy
- the organization lacks the funding or stamina to operate two worlds for long enough
Also, avoid it when the target architecture is still conceptually unstable. Running in parallel does not compensate for poor service design. If your bounded contexts are speculative and your domain language is mushy, parallel run only gives you faster evidence that you are confused.
Sometimes a simpler path is better: carve out a distinct capability, route traffic to the new service, and accept a contained cutover risk. Architecture should fit the economics of the problem.
Related Patterns
Parallel run often appears with several adjacent patterns.
Strangler Fig Pattern
This is the natural companion. The old system is incrementally surrounded and displaced by new capabilities. Parallel run provides the validation mechanism inside the strangler journey.
Anti-Corruption Layer
Essential when the legacy model is semantically toxic. The ACL prevents old concepts from leaking into new bounded contexts.
Event Sourcing and Event-Carried State Transfer
Useful in some migration scenarios, especially when replay and auditability matter. But do not force event sourcing just because Kafka is present. Plenty of effective parallel run architectures use ordinary stateful services with event integration.
Saga / Process Manager
Relevant where business workflows span multiple services and legacy interactions. During migration, a saga can coordinate new steps while legacy retains authority for others.
Branch by Abstraction
Helpful inside codebases and adapters, especially when you need to swap implementations behind stable interfaces.
The key is not to collect patterns like stamps. Use them to support a coherent migration narrative.
Summary
Parallel run architecture is what enterprises use when replacement must be earned, not announced.
It works because it treats migration as a problem of business truth rather than code motion. The old and new systems process the same intent. Their outcomes are reconciled. Authority shifts only when evidence says it should. In microservices migration, especially with Kafka and event-driven integration, this gives organizations a practical path from monolithic dependency to service-oriented ownership.
But it is not cheap, and it is not clean. It demands bounded contexts, explicit domain semantics, serious reconciliation, careful side-effect control, and operational rigor. The progressive strangler migration matters because you cannot replace institutional memory in one move. You have to expose it, name it, test it, and then retire it.
That is the real lesson.
The hardest part of migration is not building the new engine. It is learning what the old engine was really doing all along.
And parallel run, done properly, is how you learn without crashing the plane.
Frequently Asked Questions
What is a service mesh?
A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.
How do you document microservices architecture for governance?
Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.
What is the difference between choreography and orchestration in microservices?
Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.