Change Data Capture Patterns in Data Architecture

⏱ 20 min read

Most integration problems do not begin with technology. They begin with timing.

One system knows something before another system does. A customer changes address in billing, but shipping still sees the old one. Finance closes the books on one version of a transaction while analytics reports another. Operations asks a painfully simple question — “what changed, when, and who knows about it now?” — and the architecture replies with a shrug.

That shrug is where Change Data Capture earns its keep.

CDC is one of those patterns that sounds tactical and turns out to be deeply architectural. On the surface, it is just a way to observe inserts, updates, and deletes and move them somewhere else. In practice, it is a decision about how an enterprise treats truth in motion. It sits between transaction processing and analytical consumption, between bounded contexts and enterprise reporting, between legacy systems that were never designed to cooperate and modern event-driven platforms that demand they do.

Done well, CDC becomes a disciplined bridge: operational systems remain focused on transactions, downstream platforms consume ordered facts about change, and the enterprise gains a more honest picture of what happened. Done badly, CDC becomes a hall of mirrors: duplicate events, semantic confusion, runaway latency, accidental coupling, and data teams forced to reverse-engineer business intent from rows flickering in a transaction log.

So this is not just an article about moving changed rows from A to B. It is about how to design with CDC without letting the data platform become a crime scene.

Context

Most enterprises did not wake up one morning and decide to build a neat event-driven data architecture. They accumulated one.

There is usually a core transaction estate — ERP, CRM, order management, billing, warehouse systems, line-of-business applications — each optimized for its own workflow, often with its own data model and its own sense of what a “customer” or an “order” means. Around that estate sit reporting platforms, data lakes, operational dashboards, machine learning pipelines, and microservices that need slices of the same business reality. microservices architecture diagrams

Historically, enterprises solved this with batch extraction. Nightly ETL was the great peacemaker. It was also a liar. Nightly jobs tell you what was true at some convenient moment, not what changed as the business moved. For many use cases that is fine. For fraud detection, inventory visibility, customer notifications, compliance trails, and near-real-time process orchestration, it is not.

CDC appears when the business asks for fresher data without rewriting every core system.

The appeal is obvious:

avoid invasive changes to source applications
capture changes near the source of truth
distribute data incrementally rather than full reloads
feed Kafka, data lakes, warehouses, caches, and services
support migration away from tightly coupled point-to-point integrations

But the architectural importance of CDC is not freshness alone. It is selective propagation of state transitions. That phrase matters. Architecturally, CDC is not just replication. It is a way to turn operational persistence into a stream of facts that other bounded contexts can consume.

And there is the catch. A database row changing from status = P to status = S may mean “payment settled,” “shipment staged,” or “back-office workaround applied by a human at 3:14 PM.” The log sees bytes. The business sees meaning. Good CDC architecture lives in that gap.

Problem

The core problem CDC addresses is simple to describe and awkward to solve:

How do we propagate data changes from operational systems to downstream consumers reliably, quickly, and without turning source systems into integration engines?

That problem breaks into several sub-problems:

Timeliness

Downstream systems need changes faster than batch allows.

Load

Repeated full extracts punish production databases.

Coupling

Application teams should not have to embed bespoke outbound integrations for every consumer.

History

Consumers often need a record of changes, not only the latest state.

Migration

Enterprises need a path from legacy systems toward event-driven or service-based models without a big bang rewrite.

Consistency

Consumers need confidence that what they receive is complete, ordered enough for the use case, and reconcilable against source reality.

The naive answer is “just stream the database changes.” That works exactly until it does not.

Because the real problem is not simply transporting changes. It is preserving operational integrity while exposing data movement in a way that aligns with domain semantics. Rows are not business events. Table updates are not contracts. A transaction log is not a ubiquitous language.

Forces

CDC sits in a field of competing forces. Architecture gets interesting when good things collide.

Operational autonomy vs enterprise visibility

Source systems should own their write models and transaction boundaries. Yet the enterprise wants broad visibility into customer, order, account, product, and shipment changes.

The more directly you expose source schemas, the more downstream consumers inherit the source system’s internal design. That is efficient in the short term and corrosive in the long term.

Low latency vs correctness

Near-real-time data movement is attractive. But low latency without reconciliation is just fast wrongness.

A CDC pipeline may appear healthy while silently dropping events during schema changes, connector failover, transaction log retention issues, or consumer deserialization errors. Architects should be suspicious of architectures that optimize only for speed.

Generic replication vs domain semantics

Database logs capture technical changes. Businesses operate on semantic changes.

A single order placement may touch ten tables. A compensation workflow may update the same row several times in one transaction. A consumer that treats each row mutation as a business event will build nonsense quickly.

Domain-driven design matters here. Bounded contexts define meaning. CDC supplies raw signals; domain services or stream processors often need to shape those signals into business-relevant events.

Decoupling vs accidental dependency

CDC is often sold as decoupling. Sometimes it is. Sometimes it is just hidden coupling with better tooling.

If thirty downstream consumers bind themselves to a legacy table structure emitted through Kafka, the architecture has not become more flexible. It has simply moved the blast radius from JDBC to topics. event-driven architecture patterns

Migration speed vs architecture hygiene

CDC is a powerful migration tool because it lets new platforms observe old systems without forcing immediate source changes. But temporary bridges have a habit of becoming permanent roads.

An architect has to ask: is this CDC feed a transitional seam, a durable integration product, or both? The answer changes how much normalization, governance, and semantic enrichment you should invest in. EA governance checklist

Solution

The practical solution is to treat CDC as a layered pattern, not a single tool.

At its core, a CDC architecture usually has four stages:

Capture changes from the source system

This may use transaction logs, redo logs, WAL, binlog, trigger tables, timestamps, or application outbox tables.

Transport those changes to a durable streaming or messaging backbone

Kafka is common because it gives partitioned, replayable logs and broad ecosystem support.

Shape the raw changes into useful data products

This may include schema normalization, enrichment, deduplication, transaction boundary handling, keying, masking, and transformation from row-level change events into domain events or materialized state views.

Consume them in analytics, integration, search, caches, microservices, or synchronization targets.

That layered view is essential because not all consumers want the same thing.

A data lake may want raw immutable change records.
A warehouse may want compacted current-state tables plus history.
A notification service may want domain events such as OrderPaid.
A search index may want idempotent upserts.
A customer service application may want a denormalized read model.

CDC provides the source pulse. It should not force every consumer to listen to the heartbeat directly.

Common CDC implementation patterns

Log-based CDC

This is usually the best pattern when available. The connector reads database transaction logs instead of querying base tables. It minimizes source impact and captures committed changes in order at the log level.

Typical technologies include Debezium, Oracle GoldenGate, SQL Server CDC, and native cloud database change streams.

Best for:

high-volume systems
low source impact
broad downstream fan-out
replay and recovery support

Limitations:

source-specific operational complexity
schema evolution handling required
row-level changes still need semantic interpretation

Trigger-based CDC

Database triggers write changes to audit or queue tables. This works when log access is unavailable or governance prevents log-based tools. ArchiMate for governance

Best for:

constrained environments
simpler use cases
targeted table capture

Limitations:

increases source transaction overhead
operationally brittle under heavy write load
easy to entangle business logic and integration logic

Query-based incremental extraction

Changes are detected via timestamps, version columns, or high-water marks.

Best for:

low criticality data
systems with no log access
simple analytical pipelines

Limitations:

misses deletes unless designed explicitly
weak ordering guarantees
difficult with clock drift and late commits
often unsuitable for event-driven integration

Outbox pattern

The application writes domain events into an outbox table within the same transaction as the business update. CDC then publishes those outbox records.

This is often the cleanest pattern for microservices because it preserves domain semantics and transactional integrity.

Best for:

service-owned systems
explicit domain event publication
avoiding dual writes

Limitations:

requires application change
less useful when source system is legacy and opaque

Architecture

A robust CDC architecture separates raw capture from semantic publication. That is the design move that keeps flexibility alive.

This diagram hides an important truth: the raw change topic is not the enterprise contract. It is a technical substrate.

That distinction becomes even more important in domain-driven design.

Domain semantics and bounded contexts

In DDD terms, a source database schema is rarely the same thing as a bounded context contract. Databases are implementation artifacts. Bounded contexts are meaning boundaries.

Suppose the orders table emits a status update. In the Order Management context, that may be a normal progression. In Finance, the interesting event is revenue recognition eligibility. In Fulfillment, the relevant fact is pick-pack-ship readiness. One row change can have several valid downstream interpretations.

This is why sophisticated CDC architectures often use a two-step publication model:

raw data change stream for technical lineage, replay, and data engineering
semantic event stream for business consumption across services and domains

The first stream says, “column X changed.”

The second says, “Order confirmed” or “Customer credit limit reduced.”

Those are not the same thing, and pretending otherwise is the shortest path to enterprise confusion.

Kafka and microservices

Kafka fits naturally in CDC-heavy estates because it decouples producers from many consumers and gives replayability. But Kafka does not magically solve semantics. It solves transport and retention very well.

A typical enterprise pattern looks like this:

The raw topic may be table-oriented, keyed by primary key, and contain before/after states. Stream processing then creates better-shaped outputs:

filtered events
domain-specific payloads
reference-data enrichment
PII masking
compaction-friendly current-state topics
reconciliation counters and audit metrics

Architecturally, that middle layer is where you stop database plumbing from leaking into the whole enterprise.

Ordering and transaction boundaries

One of the first surprises in CDC projects is that “the order of events” is slippery.

Within a single database transaction log, order may be clear enough. Across partitions, topics, services, and source systems, global order disappears. That is normal. Design for local ordering where it matters and idempotency everywhere else.

Questions to settle explicitly:

Do consumers require per-key ordering or cross-entity ordering?
How are multi-row transactions represented?
Are tombstones emitted for deletes?
Is before/after image available?
What is the replay strategy?
What offset or log position defines recovery?

These are not plumbing details. They shape what business processes can safely be built on top.

Migration Strategy

CDC shines brightest during migration. It gives architects a seam — and seams are gold.

The strongest enterprise use of CDC is not merely feeding analytics. It is enabling progressive strangler migration from legacy systems.

Instead of replacing a monolith all at once, the enterprise uses CDC to observe the monolith’s state changes, build new capabilities around those streams, and gradually move responsibility outward. This is migration with a pulse, not a cliff jump.

Progressive strangler approach

A sensible migration path often looks like this:

Start with observation

Capture changes from the legacy system and publish them to a durable stream. Build confidence in completeness, latency, and schema understanding.

Create downstream read models

Use CDC to populate search indexes, customer 360 views, inventory projections, or reporting stores. This delivers business value without disturbing the source system.

Introduce semantic transformation

Move from raw row changes to domain-oriented events and data products.

Carve out new capabilities

New microservices consume CDC-fed views or events while commands still originate in the legacy estate.

Shift write ownership selectively

For chosen subdomains, move command handling to new services. Use anti-corruption layers and reconciliation to manage coexistence.

Retire legacy tables or functions

Once ownership is clear, stop relying on CDC as a translation crutch for that capability.

The strangler mindset matters. CDC is a migration enabler, not the target architecture itself.

Reconciliation as a first-class concern

Migration stories fail when architects treat reconciliation as an afterthought. They say, “the pipeline is reliable,” when what they mean is, “we have not noticed obvious breakage yet.”

A serious CDC migration includes reconciliation at multiple levels:

record count checks between source tables and targets
hash or checksum comparisons for key aggregates
high-water mark monitoring for lag and completeness
business-level balances such as order totals, invoice counts, inventory positions
exception queues for malformed or late events
replay procedures for targeted backfills

Reconciliation is the architectural equivalent of balancing the books. If you cannot prove the stream reflects reality, you are not integrating — you are hoping.

Enterprise Example

Consider a global retailer with an aging ERP platform running order capture, inventory allocation, and store replenishment. The ERP is stable in the way a concrete bunker is stable: hard to move, hard to love, impossible to casually replace.

The business wants three things:

near-real-time inventory availability in digital channels
event-driven fulfillment services for click-and-collect
a modern analytics platform with historical change visibility

A big bang replacement is fantasy. So the enterprise chooses a CDC-centered migration.

Step 1: Capture from ERP

The architecture team uses log-based CDC from the ERP’s relational database, publishing raw changes into Kafka. They begin with order headers, order lines, stock balances, store transfers, and product references.

At first, downstream teams are thrilled. Data is flowing within seconds instead of overnight. Then reality arrives.

The ERP updates stock balances for reasons that have nothing to do with customer-visible inventory. Reservation adjustments, batch corrections, and reconciliation jobs all produce row changes. If every stock mutation is emitted as “inventory changed,” downstream services overreact and customers see phantom availability.

This is where domain thinking saves the design.

Step 2: Shape semantics

A stream processing layer classifies raw changes into business-relevant categories:

InventoryReserved
InventoryReleased
StockReceiptPosted
StoreTransferCompleted

It also maintains a current availability projection by location, subtracting non-sellable stock and in-transit quantities according to explicit domain rules.

Now click-and-collect services consume the projection, not the raw table updates. Finance analytics still consumes the raw stream for audit purposes. Different consumers, different products, same source pulse.

Step 3: Strangle fulfillment

Next, the retailer introduces a new fulfillment service for click-and-collect orchestration. Initially, it is read-only, reacting to CDC-derived events and projections. Later, command routing changes: online reservations are created in the new service first, then synchronized back to ERP through a controlled integration layer.

For a period, both systems participate. This is dangerous territory. Reconciliation dashboards compare reservation counts and stock deltas by store every fifteen minutes. Exceptions above tolerance route to operations. Without that discipline, the migration would collapse under duplicate reservations and misaligned inventory.

Results

The retailer achieves near-real-time inventory visibility, isolates legacy ERP semantics behind cleaner event contracts, and gradually moves fulfillment capabilities without touching the ERP core on day one.

The architecture succeeds not because CDC moved data, but because the enterprise treated semantics, ownership, and reconciliation as serious design elements.

Operational Considerations

CDC projects often fail in operations long after the diagrams looked beautiful.

Schema evolution

Source schemas change. Columns are added, renamed, widened, repurposed, or quietly abused. If downstream consumers bind directly to raw payloads, schema drift becomes enterprise drift.

Use schema registries, compatibility policies, consumer contract tests, and disciplined versioning. Better still, insulate many consumers behind curated semantic topics.

Backpressure and lag

When downstream processing slows, CDC lag grows. For some use cases that is acceptable. For fraud screening or inventory promises, it may not be.

Monitor:

source log position vs consumer offset
topic throughput and partition skew
processing latency by stage
lag by critical entity
dead-letter volumes

Security and privacy

CDC can leak more than teams realize. Raw database changes may include PII, financial data, internal notes, or regulated attributes never intended for broad distribution.

Mask or filter at the earliest sensible point. Apply domain-based access controls. A raw CDC stream is often too sensitive to be a general-purpose data product.

Replay and recovery

Replay is one of Kafka’s great strengths and one of the fastest ways to create duplicates if idempotency is weak.

Define:

replay scope
deduplication keys
target reinitialization process
snapshot plus stream bootstrap strategy
cutover checkpoints for migration events

Data quality observability

Traditional infrastructure monitoring is not enough. You also need semantic observability:

impossible state transitions
sudden null spikes in critical attributes
business volume anomalies
reconciliation failures
staleness by domain aggregate

A healthy connector can still feed a sick business process.

Tradeoffs

CDC is powerful because it avoids invasive source changes. That is also its biggest compromise.

Advantages

low-friction integration with existing systems
near-real-time propagation
reduced load compared with repeated full extraction
strong support for migration and strangler patterns
replayable history when paired with log-based streaming
broad utility across analytics and operational integration

Costs

row changes lack business intent
source schema leaks easily into downstream architecture
eventual consistency is unavoidable
reconciliation complexity is substantial
deletes and transaction semantics can be tricky
operational support requires serious maturity

If you remember one line from this article, let it be this:

CDC is excellent at telling you that data changed. It is mediocre at telling you what the business meant.

That is not a flaw in CDC. It is a reminder to stop asking one pattern to do another pattern’s job.

Failure Modes

The most common CDC failures are not spectacular explosions. They are slow leaks.

Semantic misinterpretation

A team treats table mutations as business events and builds automation on top. Months later, process exceptions pile up because not every status change means what consumers assumed.

Hidden coupling

Downstream consumers depend on source column names, table structures, and internal codes. Source modernization becomes almost impossible because the database schema has become a public API by accident.

Incomplete capture

Deletes are not captured. Certain tables are excluded. Log retention expires before connectors catch up. Bulk maintenance jobs bypass the expected path.

Duplicate processing

Replays, retries, connector failovers, and partition rebalances create duplicate delivery. Consumers that are not idempotent produce double notifications, duplicate invoices, or inflated aggregates.

Dual-write inconsistency during migration

A new service writes one system while a legacy process writes another. CDC is then asked to stitch the truth together after the fact. It rarely ends well without explicit ownership boundaries.

Over-centralized transformation

An enterprise creates one giant “CDC platform team” that owns every transformation for every domain. They become the bottleneck, domain semantics become watered down, and teams bypass the platform with ad hoc feeds.

When Not To Use

CDC is not a universal answer. Sometimes the right move is to resist it.

Do not use CDC as the primary integration pattern when:

you control the application and can emit proper domain events directly
the business process requires synchronous validation and immediate consistency
source data structures are too unstable to expose safely even indirectly
the use case is low-frequency and batch is good enough
compliance rules forbid broad dissemination of raw operational changes
teams lack operational capability for streaming infrastructure and reconciliation

In a well-designed greenfield microservice, the outbox pattern or explicit event publication is often better than generic CDC. If the application can say “CustomerOnboarded” itself, do not force consumers to infer it from three table updates and a nullable flag.

Likewise, if a monthly finance report can tolerate nightly loads, a heavy CDC estate may be architectural theatre. Freshness is not free.

CDC rarely stands alone. It works best as part of a pattern language.

Outbox pattern

Best for publishing domain events transactionally from service-owned systems.

Event sourcing

Stores domain events as the system of record itself. Much richer semantically than CDC, but also more invasive and demanding.

Strangler fig pattern

Use CDC as an observation and synchronization seam while gradually replacing legacy functions.

Data vault and history-aware warehousing

CDC feeds can be useful for preserving historical change trails and loading hubs, links, and satellites.

Materialized view / CQRS read models

CDC can populate denormalized read models for search, dashboards, and operational query patterns.

Anti-corruption layer

Essential when raw legacy semantics need translation before entering modern bounded contexts.

These patterns complement each other. The mature architecture chooses deliberately instead of forcing everything through one integration shape.

Summary

Change Data Capture is one of the most useful patterns in modern data architecture precisely because it is humble. It does not demand that every source system be rewritten. It meets enterprises where they are: with old databases, messy semantics, urgent reporting needs, microservice ambitions, and migration programs that cannot stop the business.

But humility should not be mistaken for simplicity.

CDC is a bridge, not a destination. A log, not a language. A mechanism for observing change, not a substitute for domain modeling. The best architectures use CDC to expose movement in operational truth, then add the things raw data cannot provide on its own: bounded-context semantics, curated contracts, reconciliation, idempotency, ownership boundaries, and operational discipline.

If you are building a CDC architecture, design the layers clearly:

capture raw changes reliably
separate technical streams from business contracts
align outputs to domain semantics
reconcile aggressively
migrate progressively through strangler seams
avoid turning source schemas into enterprise APIs

In enterprise architecture, the dangerous designs are often the ones that look easiest at first. CDC can be wonderfully pragmatic, but only if you respect what it does not solve.

Rows change all the time. The real architectural work is deciding what those changes mean, who should care, and how much truth in motion your enterprise can responsibly handle.

Frequently Asked Questions

What is enterprise architecture?

Enterprise architecture aligns strategy, business processes, applications, and technology in a coherent model. It enables impact analysis, portfolio rationalisation, governance, and transformation planning across the organisation.

How does ArchiMate support architecture practice?

ArchiMate provides a standard language connecting strategy, business operations, applications, and technology. It enables traceability from strategic goals through capabilities and services to infrastructure — making architecture decisions explicit and reviewable.

What tools support enterprise architecture modeling?

The main tools are Sparx Enterprise Architect (ArchiMate, UML, BPMN, SysML), Archi (free, ArchiMate-only), and BiZZdesign. Sparx EA is the most feature-rich, supporting concurrent repositories, automation, scripting, and Jira integration.

Context

Problem

Forces

Operational autonomy vs enterprise visibility

Low latency vs correctness

Generic replication vs domain semantics

Decoupling vs accidental dependency

Migration speed vs architecture hygiene

Solution

Common CDC implementation patterns

Log-based CDC

Trigger-based CDC

Query-based incremental extraction

Outbox pattern

Architecture

Domain semantics and bounded contexts

Kafka and microservices

Ordering and transaction boundaries

Migration Strategy

Progressive strangler approach

Reconciliation as a first-class concern

Enterprise Example

Step 1: Capture from ERP

Step 2: Shape semantics

Step 3: Strangle fulfillment

Results

Operational Considerations

Schema evolution

Backpressure and lag

Security and privacy

Replay and recovery

Data quality observability

Tradeoffs

Advantages

Costs

Failure Modes

Semantic misinterpretation

Hidden coupling

Incomplete capture

Duplicate processing

Dual-write inconsistency during migration

Over-centralized transformation

When Not To Use

Related Patterns

Outbox pattern

Event sourcing

Strangler fig pattern

Data vault and history-aware warehousing

Materialized view / CQRS read models

Anti-corruption layer

Summary

Frequently Asked Questions

What is enterprise architecture?

How does ArchiMate support architecture practice?

What tools support enterprise architecture modeling?