CDC Pipelines Freeze Legacy Semantics

⏱ 21 min read

Most legacy estates do not fail because they cannot move data. They fail because they move yesterday’s meaning with tomorrow’s speed.

That is the trap at the center of many change data capture programs. A team adds CDC, streams database mutations into Kafka, fans them out to microservices, and declares victory. The dashboards light up. Events flow. Latency drops from hours to seconds. And yet the architecture gets worse. Why? Because the pipeline is not merely transporting facts. It is embalming old assumptions—table design, status codes, lifecycle quirks, and accidental coupling—and broadcasting them farther than the old application ever could.

This is where enterprise architecture earns its keep. A CDC pipeline is not just an integration mechanism. It is a topology for change propagation. It determines how meaning escapes one bounded context and contaminates another, or, if designed well, how meaning is carefully translated at the boundary. The difference is existential.

I am opinionated about this: raw CDC is one of the fastest ways to industrialize legacy semantics. It is useful, often necessary, and frequently dangerous. If you treat database changes as business events, you are not modernizing. You are giving the old system a louder microphone.

The right question is not “Can we publish table changes?” It is “What semantics are we choosing to preserve, what semantics are we willing to reinterpret, and where do we draw the line between source truth and downstream autonomy?” That is a domain-driven design question before it is a platform question.

This article looks at CDC pipelines through that lens: not as plumbing, but as architecture. We will look at the forces, the topology choices, migration strategy, reconciliation, operational realities, and where this pattern breaks down. We will also look at an enterprise example, because theory is cheap and production scars are more persuasive.

Context

Large enterprises inherit a certain shape of system. The core operational logic lives in a monolith or a handful of tightly coupled systems. Data is persisted in relational databases designed over years of changing regulation, acquisitions, product expansions, and local workarounds. Tables become both storage and interface. Reporting jobs scrape them. Integration platforms poll them. Batch exports calcify around them.

Then comes the modernization agenda.

The business wants digital channels, real-time decisions, partner APIs, event-driven workflows, and independently deployable services. The old stack cannot deliver quickly enough. Teams reach for CDC because it seems to offer a politically acceptable bridge: no invasive changes to the legacy app, near real-time propagation, and a path to Kafka-based integration without waiting for the monolith team to produce proper domain events. event-driven architecture patterns

That logic is not wrong. In many enterprises, CDC is the only feasible extraction seam. If the core system is vendor-packaged, under change freeze, or too fragile to modify, reading the transaction log is often safer than touching the application. Debezium, GoldenGate, SQL Server CDC, and similar tools have become standard weapons for that reason.

But once the stream exists, it starts doing more than expected. The stream becomes input to notification services, customer profile services, fraud rules, search indexes, analytics products, and machine learning features. A simple replication feed turns into a de facto enterprise event backbone.

That moment matters. Because the topology you choose now will either support a strangler migration or sabotage it.

Problem

The central problem is simple to state and hard to solve:

CDC captures data mutations, not domain intent.

A row updated from status = 2 to status = 5 may correspond to “policy bound,” “claim approved,” “customer suspended,” or “shipment backordered,” depending on the table and the tribal knowledge around it. Sometimes the meaning is even worse: the update reflects an implementation artifact, such as denormalized totals being recomputed, audit columns changing, or a batch repair script correcting historical rows. CDC sees all of this as change. Downstream consumers often mistake it for business signal.

This confusion creates three architectural pathologies.

First, semantic leakage. The legacy database schema becomes the integration contract. Consumers learn table names, join logic, magic constants, and ordering quirks. Every downstream team now depends on the old model.

Second, topology amplification. The old application may have hidden inconsistencies behind synchronous transaction boundaries and user interface flows. CDC broadcasts intermediate states, technical retries, and reordering across a distributed landscape. What was once a local implementation detail becomes an enterprise coordination problem.

Third, migration paralysis. Once dozens of services consume the legacy change stream directly, replacing the source system becomes harder, not easier. The migration has widened the blast radius of the incumbent model. The company thinks it is strangling the monolith; in reality, it is cloning it in public.

A lot of failed modernization programs are just this pattern wearing a cloud badge.

Forces

This is not a morality play where CDC is bad and domain events are good. Real architecture lives in tradeoffs. Several forces pull in different directions.

Need for speed

The business wants outcomes now. CDC gives near real-time integration without requiring changes to source applications. That matters when regulatory deadlines, merger integrations, or digital launches are on the calendar.

Legacy constraints

The source system may be unmodifiable, vendor-managed, or simply too risky to change. Reading logs is less invasive than embedding new publish logic inside old code.

Domain ambiguity

Legacy schemas are often poor expressions of the domain. They represent persistence optimization, historical compromise, and reporting convenience as much as business concepts. Yet those tables may still be the only comprehensive source of truth.

Consumer diversity

Some consumers need raw data replication. Others need business-level events. Search indexes, data lakes, and operational services have very different semantic needs. One pipeline rarely serves all well.

Reliability expectations

CDC promises durability and replay. Kafka extends that promise into retention, fan-out, and back-pressure handling. But reliability at the transport layer does not magically create correctness at the business layer.

Organizational boundaries

Domain-driven design teaches us to respect bounded contexts. Enterprises usually do the opposite under time pressure. A central platform team publishes generic change topics; downstream teams self-serve. It looks scalable, but often bypasses context mapping and language translation.

Migration ambition

If the goal is progressive strangler migration, the topology should reduce dependence on the legacy model over time. If the topology instead encourages direct coupling to source tables, each new consumer digs the trench deeper.

These forces are why CDC is attractive and dangerous in equal measure.

Solution

The sensible solution is not “avoid CDC.” It is contain CDC inside an anti-corruption layer and publish translated change semantics outward.

In practice, that means separating three concerns:

Capture legacy mutations reliably.
Interpret those mutations using domain-aware translation rules.
Propagate context-appropriate events or state changes to downstream consumers.

That middle step is where many architectures cut corners. They should not.

A CDC topic with table-level payloads is a useful internal substrate. It is not, by itself, a contract for the enterprise. The anti-corruption layer reads raw changes, reconstructs meaningful aggregates where necessary, applies mapping logic, enriches with reference data, suppresses technical noise, and emits business-facing messages aligned to a bounded context.

This preserves optionality. You can still support consumers that truly need row-level replication, but you stop pretending that all consumers do. More importantly, you create a migration seam. The translated event contract can survive the eventual replacement of the legacy source, because it is defined in business language rather than table mutation language.

A healthy architecture usually has two topologies operating side by side:

Replication topology for data engineering, search indexing, audit replay, and low-level synchronization.
Semantic topology for operational microservices and workflow automation.

Conflating them is the root of most pain.

A reference topology

The key idea is boring and important: capture close to the source, interpretation close to the domain, and consumption close to the bounded context.

That is not ceremony. It is how you prevent a database schema from becoming your enterprise ontology.

Architecture

Let’s get concrete.

1. Raw capture layer

The raw capture layer extracts inserts, updates, and deletes from the source database transaction log. This should be as faithful and low-intrusion as possible. Keep it mechanical. Do not put business interpretation here. This layer exists to preserve ordering as much as the source allows, maintain replayability, and isolate source-specific concerns such as schema evolution and connector behavior.

Kafka is often the right backbone because it provides durable logs, replay, partitioning, consumer independence, and ecosystem maturity. But the topic design matters. Raw topics should be explicitly named as technical feeds, not masquerading as domain streams. A topic called legacy.customer_table_changes is honest. A topic called customer-events for row mutations is architectural fraud.

2. Semantic translation layer

This is the heart of the design.

A translator service—or a small set of them aligned to bounded contexts—consumes raw changes and emits domain-level signals. Translation may involve:

joining multiple table changes into an aggregate view
detecting transitions rather than raw states
collapsing noisy updates
enriching with static or reference data
interpreting old status codes into explicit lifecycle stages
emitting idempotent business events
publishing curated snapshots for consumers that need current state

This layer often maintains local state. That makes some engineers nervous because they want “stateless streaming.” Ignore the dogma. If domain meaning requires aggregate reconstruction or transition detection, state is part of the job.

3. Context-facing contracts

Downstream operational services should consume one of two things:

Domain events, where the temporal transition matters
Curated state topics, where the latest interpreted state matters more than every intermediate mutation

A billing service might care that an order became billable. A customer profile read model might care only about the latest interpreted customer status. Different contracts for different jobs.

4. Reconciliation loop

Eventually, distributed systems drift. Messages arrive late. Consumers miss windows. Mappings change. The source system contains historical oddities that do not fit today’s logic. So every serious CDC architecture needs reconciliation, not as a patch but as a first-class design element.

Reconciliation compares source truth and derived truth, then corrects discrepancies through replay, compensating events, or explicit repair workflows. Without reconciliation, teams tend to oversell “exactly once” and underinvest in correctness.

5. Boundary ownership

Each translated stream should have a product owner in all but name. Someone must own the semantics, schema evolution, compatibility policy, and migration roadmap. Enterprise event platforms rot when nobody owns the meaning.

Domain semantics discussion

This is where domain-driven design matters.

A bounded context is not a technical partition. It is a semantic boundary. The same source row can mean different things in different contexts, and that is fine. In fact, it is healthy. A payment ledger, a customer engagement platform, and a fraud engine should not all be forced to consume the same notion of “customer status” simply because the legacy CRM stores one field with that name.

CDC tempts us to believe there is one canonical truth because there is one canonical database. Enterprises have spent decades proving otherwise.

A better pattern is to use CDC as raw evidence, then build published language per context. That language can still share common nouns, but the contracts should reflect each context’s purpose. “PolicyActivated” in underwriting may not be the same event as “CoverageAvailable” in customer communications, even if both derive from related database mutations.

This is also where anti-corruption layers earn their name. They do not just map field names. They protect the downstream model from upstream conceptual debt. If the legacy system stores ten overloaded status codes and a nullable date that together imply account closure, the translator should emit something explicit. Downstream teams should not reverse-engineer archaeology every sprint.

A line worth keeping in your head: integration is where your domain model goes to die, unless you defend it.

Migration Strategy

A progressive strangler migration with CDC works best when you treat the pipeline as a bridge, not a destination.

Phase 1: Observe without promising too much

Stand up raw CDC feeds and use them for non-critical consumers first: audit, analytics, search indexes, low-risk notifications. Learn the source behavior. Measure ordering anomalies, duplicate rates, schema quirks, and hidden dependencies. This phase is about humility.

Phase 2: Introduce translation layers by bounded context

Identify a business capability to carve out—customer onboarding, order fulfillment, claims intake, pricing decisions—and build a translator that emits context-appropriate contracts. New services consume translated events, not raw table mutations.

Phase 3: Establish reconciliation

Before pushing business-critical workflows onto the translated streams, implement reconciliation jobs and replay procedures. If you cannot prove that derived state can be checked and repaired, you are not ready.

Phase 4: Shift write ownership gradually

As new microservices take over capabilities, they should stop depending on the legacy source as their semantic authority. At first they may still observe legacy changes. Eventually they become the producer of domain events for their capability. The anti-corruption layer then flips from translator to coexistence adapter.

Phase 5: Retire raw dependencies

This is the part many programs skip. As capabilities migrate, retire downstream consumers of raw CDC topics. If they remain, the old semantics remain sovereign. You have to close the exits behind you.

Here is the migration shape worth aiming for:

The point of strangling is not merely to reroute traffic. It is to transfer semantic authority.

Dual-run and reconciliation

For a period, both old and new paths may exist. That means dual-run. The legacy system still updates core tables. The new service emits richer events. You compare outputs, detect drift, and slowly move consumers over.

This is not glamorous work. It is spreadsheet architecture with Kafka attached. But it is the difference between migration and wishful thinking.

Enterprise Example

Consider a large insurer modernizing its policy administration estate.

The legacy policy platform is a twenty-year-old package with heavy customization. It owns policy records, endorsements, renewals, cancellations, and premium adjustments. The database is sprawling, with hundreds of tables and status fields whose meaning varies by line of business. The vendor discourages custom changes in core transaction flows, so producing proper application events is unrealistic.

The enterprise wants to build digital servicing APIs, a customer notification platform, and real-time broker integrations. CDC looks ideal.

The naive first attempt

The platform team streams policy table changes into Kafka. Several teams subscribe directly.

The notification service watches policy status changes.
The broker portal builds a read model from raw rows.
A pricing service uses premium update changes to trigger recalculation.
The analytics team lands everything in the lake.

Within months, trouble appears.

The notification team sends cancellation emails when a policy enters an intermediate suspended state used only during endorsement processing. The broker portal shows transient premium values before adjustment rows settle. The pricing service retriggers on technical corrections and duplicate updates. Every team learns obscure combinations of status code plus effective date plus version number just to infer what actually happened.

Worse, when the insurer starts replacing endorsements with a new microservice, downstream consumers resist change because they are hardwired to the old table model. microservices architecture diagrams

The corrected architecture

The insurer introduces a policy semantics layer between raw CDC and operational consumers.

This layer reconstructs policy aggregate state from several source tables, interprets lifecycle transitions, and emits explicit events such as:

PolicyBound
EndorsementIssued
RenewalOffered
PolicyCancelled
PremiumAdjusted

It also publishes a curated PolicyView topic containing the latest interpreted state for read models and portals.

The notification service now triggers only on domain events. The portal consumes the curated view. The analytics team still receives raw CDC for forensic and historical analysis. The new endorsements microservice gradually becomes the source of truth for endorsement events, while the semantics layer continues to unify old and new outputs during migration.

What changed architecturally?

Not the transport. Kafka remained. CDC remained. What changed was semantic posture.

The legacy database stopped being treated as an enterprise language. It became a source of evidence. The architecture finally had room for bounded contexts and a strangler migration that did not spread legacy coupling everywhere.

That is the move.

Operational Considerations

A CDC architecture that looks elegant on a whiteboard can still collapse under operational reality. These systems live or die on discipline.

Ordering

Database commit order is not always business order, and Kafka partition order is only per key. If your semantics depend on aggregate transitions, partition by aggregate identifier where possible. Even then, cross-aggregate workflows need tolerance for reordering.

Idempotency

Downstream consumers must assume duplicates. Translation layers should emit stable event identifiers and version metadata. “Exactly once” is a useful optimization, not a business guarantee.

Schema evolution

Raw schemas will change. Legacy teams add columns, split codes, or repurpose fields with alarming confidence. Contracts in the semantic layer need compatibility policies, versioning, and consumer communication. Treat event schemas like APIs.

Backfills and replay

You will replay. For onboarding consumers, correcting bugs, or rebuilding projections. Design replay modes explicitly. Replaying raw CDC through a translator can accidentally regenerate operational events unless you distinguish historical rebuild from live propagation.

Observability

Instrument lag, throughput, poison messages, translation failures, schema mismatches, reconciliation drift, and consumer offsets. But also add domain-level observability: counts of inferred lifecycle transitions, suppressed noisy updates, and unresolved semantic ambiguities.

Data quality

Raw source data is often dirty. Nulls where there should be values. Out-of-range codes. Retroactive corrections. Translation layers must decide whether to drop, quarantine, infer, or propagate such cases. Every choice is a business policy masquerading as code.

Security and privacy

CDC can expose columns never intended for broad distribution. If you publish raw tables, you may leak PII, financial details, or regulated attributes into the wider estate. Apply field-level controls and minimization early.

Reconciliation topology

This deserves its own picture because too many architectures omit it:

The reconciliation engine is not admitting failure. It is accepting distributed truth.

Tradeoffs

Good architecture is often choosing which pain you want on purpose.

Benefits

CDC reduces source-system change risk. It accelerates integration. Kafka-based pipelines provide fan-out, replay, and operational decoupling. A semantic layer allows progressive modernization and protects new services from legacy table design.

Costs

You are adding more moving parts: connectors, streams, translators, schema management, observability, reconciliation, and ownership boundaries. Translation logic can become complex, especially when source data is noisy or spread across many tables.

The core tradeoff

The more faithfully you mirror source changes, the easier the pipeline is to build and the harder the migration becomes.

The more aggressively you translate into domain semantics, the better the long-term architecture and the more judgment you must encode up front.

That is the honest trade. Enterprises often pretend they can avoid it with “generic events.” They cannot.

Another subtle tradeoff

Curated semantic events are excellent for operational use, but they can hide useful source detail needed for audit, forensic analysis, or data science. That is why replication topology and semantic topology should coexist rather than fight.

Failure Modes

This pattern fails in very recognizable ways.

1. Raw CDC is presented as business events

The team renames table topics with business-sounding names and tells everyone to subscribe. Consumers then bake in source joins, status mappings, and timing assumptions. Migration becomes a hostage situation.

2. The translator becomes a new monolith

If one central team owns all semantics for every domain, the anti-corruption layer turns into a semantic bottleneck. Bounded contexts disappear under a shared “enterprise model.” Keep translators aligned to capabilities, not just platforms.

3. Hidden source dependencies survive migration

A capability appears migrated, but critical consumers still rely on old CDC feeds for edge cases or reporting. The old semantics remain mandatory, so retirement stalls.

4. Reconciliation is skipped

The system works in demos and fails in quarter-end close, policy renewal season, or bulk correction weekends. Drift accumulates. Nobody can prove correctness. Confidence evaporates.

5. Event storms from technical churn

Legacy systems often rewrite rows frequently for non-business reasons. If every mutation triggers downstream behavior, you create event storms, duplicate work, and false business signals.

6. Source schema changes become enterprise incidents

When downstream services consume raw tables directly, an innocuous database alteration becomes a company-wide outage. That is not agility. That is dependency multiplication.

7. Semantic mismatches are discovered too late

A service team builds against “customer active” only to discover six months later that the field meant “billable and not soft-deleted unless under review.” Legacy semantics are full of these traps.

When Not To Use

CDC-driven change propagation is not universal medicine.

Do not use it as the primary operational integration pattern when you control the source application and can emit proper domain events from the transaction boundary. If you can publish intent directly from the business action, do that. It is cleaner.

Do not use raw CDC as the enterprise event contract for greenfield microservices. That would be importing legacy habits into new code.

Do not use CDC when domain correctness depends on application-level decisions not visible in persistence changes. Some business actions are assembled from workflows, validations, external calls, and conditional rules that never map cleanly to row mutations.

Do not use it for low-latency transactional orchestration that requires strong consistency between participants. CDC is asynchronous. It is a poor substitute for a properly designed synchronous interaction or a transactional outbox in a system you own.

Do not use it if your organization is unwilling to fund semantics ownership, reconciliation, and contract governance. In that case, you are not building architecture. You are building drift. EA governance checklist

Several patterns sit near this one and are worth distinguishing.

Transactional Outbox

If you own the source service, the outbox pattern is generally better than CDC from the database log because it captures application intent at commit time. It is not always available with legacy systems, but it is the cleaner pattern for new services.

Anti-Corruption Layer

This is the essential companion to CDC in modernization programs. It translates not just formats, but meaning. Without it, CDC is just semantic leakage at scale.

Event Carried State Transfer

Curated state topics are often a better fit than fine-grained events for read models and portals. Not every consumer needs every transition.

Strangler Fig Pattern

CDC can provide the observational seam for strangler migration. But the migration succeeds only when semantic authority moves to the new services over time.

CQRS Read Models

CDC plus semantic translation can feed query-side projections effectively. Just do not confuse a read model feed with a business event stream.

Data Mesh / Data Products

Raw CDC topics can support analytical data products, but operational services should still prefer context-shaped contracts. Shared transport does not imply shared semantics.

Summary

CDC pipelines are powerful because they make change visible. They are dangerous because they make old meaning portable.

That is the architectural truth at the center of change propagation topology. A database mutation is not a business event. A Kafka topic is not a bounded context. And streaming legacy tables into the enterprise does not count as modernization, no matter how fast the messages move.

Used well, CDC is a practical extraction seam. It lets you observe legacy behavior, feed low-level replication use cases, and build a progressive strangler migration when the source cannot be changed. Used badly, it freezes legacy semantics in amber and distributes them to every new service you hoped would escape the past.

The design move is straightforward, though not cheap: capture raw changes faithfully, translate them through domain-aware anti-corruption layers, publish context-appropriate contracts, and back the whole thing with reconciliation. Keep replication topology separate from semantic topology. Let bounded contexts own their language. Retire raw dependencies as migration advances.

If you remember one line, make it this: move the data if you must, but never let the old model become the future’s vocabulary.

That is how CDC helps you modernize instead of merely accelerating your inheritance.

Frequently Asked Questions

What is enterprise architecture?

Enterprise architecture aligns strategy, business processes, applications, and technology in a coherent model. It enables impact analysis, portfolio rationalisation, governance, and transformation planning across the organisation.

How does ArchiMate support architecture practice?

ArchiMate provides a standard language connecting strategy, business operations, applications, and technology. It enables traceability from strategic goals through capabilities and services to infrastructure — making architecture decisions explicit and reviewable.

What tools support enterprise architecture modeling?

The main tools are Sparx Enterprise Architect (ArchiMate, UML, BPMN, SysML), Archi (free, ArchiMate-only), and BiZZdesign. Sparx EA is the most feature-rich, supporting concurrent repositories, automation, scripting, and Jira integration.