Streaming Pipelines Do Not Remove Data Modeling

⏱ 20 min read

There is a particular fantasy that shows up every few years in enterprise architecture. It usually arrives wearing modern clothes. Once it was the enterprise service bus. Then it was data lakes. Now it often comes wrapped in Kafka, streams, event meshes, and real-time dashboards. The fantasy is simple: if we move data continuously, fast enough, and at scale, we no longer need to think very hard about what the data means. event-driven architecture patterns

That fantasy is expensive.

Streaming pipelines are superb at moving facts. They are not good at deciding what a fact is, when it is complete, which business boundary owns it, or what should happen when two systems disagree. A stream can carry an event called CustomerUpdated across twenty services in under a second. It still cannot tell you whether “customer” means legal entity, billing account, household, identity profile, or CRM contact. That is not a transport problem. It is a modeling problem.

This is why schema topology matters. Not merely schema design, but the shape of schemas across a landscape: where they originate, how they change, how many translations sit between domains, where canonical ambitions create bottlenecks, and where local autonomy turns into semantic chaos. In streaming architectures, topology becomes architecture. If you get it wrong, the stream becomes a rumor mill—fast, distributed, and impossible to reconcile.

The core argument is unfashionably old-fashioned: data modeling still matters, perhaps more than ever, in event-driven and streaming systems. Domain-driven design gives us a practical way to reason about semantics, ownership, and boundaries. Migration strategy tells us how to get there without stopping the business. And operational reality reminds us that every event you emit today becomes tomorrow’s contract, lawsuit, audit trail, or incident ticket.

Context

Streaming pipelines became mainstream for good reasons. Businesses want lower latency, fresher decisions, and less batch glue holding together brittle applications. Kafka and similar platforms made it practical to publish high-volume event streams, decouple producers from consumers, and build architectures where many downstream services react independently.

That part is real. The benefits are substantial.

A bank can detect fraud while a card authorization is still in flight. A retailer can update inventory positions within seconds rather than overnight. An insurer can react to claim milestones without polling three separate systems. Streaming changes the tempo of the enterprise.

But tempo is not meaning.

In many organizations, streaming adoption begins in the platform team. They build a cluster, standardize producers, add CDC connectors, define a topic naming convention, and suddenly there is an intoxicating sense of progress. Data is flowing. Teams can subscribe. Dashboards light up. Someone inevitably says, “Now we can get rid of all that painful data modeling and just use events.”

That sentence should make any architect sit up straight.

Events are not anti-model. They are model made public. Every event schema is a declaration of what the business believes happened. Every topic layout is a statement about domain boundaries. Every consumer that interprets a field differently is evidence that your semantic architecture is leaking.

The move from request/response integration to event-driven integration does not remove data modeling. It distributes it. And distributed ambiguity is much harder to repair than centralized ambiguity.

Problem

The practical problem is not that enterprises lack schemas. They have too many of them, with no coherent topology.

A typical streaming estate accumulates at least four classes of schema:

operational database schemas exposed through CDC
API payload schemas designed for application integration
event schemas intended to represent business facts
analytical schemas optimized for reporting and machine learning

These are often treated as interchangeable. They are not.

A row-level database change is not the same thing as a business event. An API contract optimized for command handling is not automatically suitable for asynchronous publication. An analytical star schema is not a domain model. When teams flatten these distinctions, they produce streams that are technically valid and semantically unstable.

This usually appears in familiar ways:

topics mirror tables rather than business capabilities
field names drift across services: customerId, partyId, accountHolderId
events contain partial or denormalized snapshots with unclear lifecycle meaning
consumers implement local joins and compensating logic to infer domain state
“canonical events” become giant lowest-common-denominator payloads
schema evolution is handled mechanically, while semantic evolution is ignored

The result is not event-driven architecture. It is distributed confusion at high throughput.

And once many consumers depend on a stream, poor modeling hardens quickly. A bad event is like wet concrete on a busy road: everyone keeps driving over it, and now you cannot change it without causing a pileup.

Forces

A serious architecture decision sits in the tension between forces, not in a tidy pattern catalog. Streaming and schema topology are shaped by several hard forces.

1. Domain autonomy versus enterprise coherence

Domain-driven design rightly tells us to honor bounded contexts. Sales, billing, fulfillment, fraud, claims, and identity each have their own language and logic. They should not be forced into a single canonical model.

But large enterprises still need coherence. Regulators, finance teams, customer service, and data platforms need to reconcile information across domains. The trick is not choosing one side. It is deciding where translation belongs. enterprise architecture with ArchiMate

Too much autonomy, and every topic becomes a dialect nobody else fully understands. Too much standardization, and every team waits for the central data council to approve a field addition.

2. Event immutability versus business correction

Streaming loves append-only logs. Businesses love corrections.

Orders are amended. Policies are backdated. Customer records are merged. Payments are reversed. Claims are reopened. Product hierarchies are reclassified. If your schema topology assumes linear, clean, immutable progression, the business will break it in the first quarter.

3. Producer convenience versus consumer usefulness

The easiest event for a producer to emit is often a CDC record or a serialized aggregate snapshot. The most useful event for consumers is usually a well-named domain fact with stable semantics and explicit identifiers.

Those are rarely the same artifact.

4. Local optimization versus long-term evolution

A team under delivery pressure will publish whatever gets the feature done. Six months later, ten downstream consumers rely on it, and now semantics cannot change cheaply. Event schemas have a nasty habit of becoming architecture through inertia.

5. Real-time aspiration versus reconciliation reality

No matter how good the stream, some systems of record remain asynchronous, incomplete, or wrong. Enterprises need reconciliation processes—periodic comparison, exception handling, and repair. Architects who ignore reconciliation are designing for demos, not operations.

Solution

The solution is to treat schema topology as a first-class architectural design concern and to root it in domain semantics, not transport mechanics.

That means a few opinionated choices.

First, model events around business facts within bounded contexts. Do not publish database changes and pretend they are domain events. If you want to expose CDC, expose it honestly as technical change data, not as business truth.

Second, design a topology of schemas rather than aiming for one schema to rule them all. Enterprises need multiple schema layers, each with distinct responsibilities.

A practical topology usually looks something like this:

source schemas: operational tables, internal service models
domain event schemas: business facts owned by a bounded context
integration projections: context-translated views for cross-domain use
analytical schemas: warehouse or lakehouse models optimized for query and reporting

Third, make translation explicit. Translation is not architectural failure. It is architecture. Between bounded contexts, anti-corruption layers remain just as important in streaming as they were in service integration.

Fourth, separate semantic governance from centralized ownership. A platform team can run schema registries, compatibility tooling, topic policies, and observability. It should not become the owner of business meaning. Semantics belong to domain teams, with enterprise review only where cross-domain contracts justify it. EA governance checklist

Fifth, design for reconciliation from day one. Streams improve propagation. They do not eliminate disagreement. Every important cross-system business process needs a way to detect divergence and repair it.

In short: model locally, publish deliberately, translate explicitly, reconcile routinely.

Architecture

A useful streaming architecture is not “everything emits events.” It is a set of domain-aligned streams with clear ownership, selective projections, and a reconciliation backbone.

Here is the conceptual shape.

The key distinction is between domain event topics and integration projections.

A domain event topic should express facts meaningful inside a bounded context: PolicyIssued, ClaimRegistered, PaymentAuthorized, ShipmentDispatched. These are not just rows changing. They are business moments. They carry identifiers, occurrence time, version, causation metadata, and enough context to be useful without turning into bloated snapshots.

Integration projections are downstream artifacts built specifically for consumers outside the originating domain. A fraud service may not want all the detail inside PaymentAuthorized; it may need a risk-oriented projection enriched with customer trust tier and merchant category. That is a projection. Call it one. Own it as one.

This is where schema topology starts to matter. If every external consumer reads raw domain events and performs its own interpretation, your architecture quietly creates ten competing semantic models. Better to concentrate those translations where they can be governed and observed.

Domain semantics and bounded contexts

In DDD terms, each bounded context should own its ubiquitous language and event model. “Customer” in identity is not the same as “customer” in billing, and pretending otherwise is the beginning of semantic debt.

The architectural move is simple but often resisted: let contexts publish in their own language, then translate for others.

A bank is a good example. The retail banking domain may publish CurrentAccountOpened. The CRM domain may think in terms of PartyEngaged. The compliance domain cares about KYCVerified. These are related but not equivalent. Forcing them into one canonical CustomerEvent with 120 optional fields creates a schema that offends everyone and clarifies nothing.

You want shared identifiers and traceability, not forced linguistic unity.

Event classes

Not all streams are equal. It helps to classify them.

domain events: meaningful business facts
technical events: operational signals such as retries, failures, workflow transitions
CDC streams: source-level change logs
state snapshots: compacted views for current-state consumers

Confusing these causes endless trouble. Teams consume CDC as if it were business truth, or they push snapshots into topics called “events” and later wonder why consumers misread them.

Use naming and governance to make these distinctions visible. Honest labels reduce accidental misuse. ArchiMate for governance

Schema evolution

Schema evolution in streaming is usually treated as a compatibility puzzle: backward, forward, full. Useful, but not enough.

A field can be backward compatible and semantically destructive.

For example, changing status=ACTIVE to mean “commercially active but pending regulatory validation” may keep parsers happy while misleading every consumer. Compatibility tooling catches syntax drift. It does not catch meaning drift. That requires domain review, consumer impact analysis, and, in some cases, a new event version or a new event entirely.

A good rule is this: if the business interpretation changes materially, treat it as a semantic versioning event, not merely a schema edit.

Here is a practical topology with translation points.

Notice the anti-corruption layers. They are not a sign that the model failed. They are what stop one domain’s language from colonizing another.

Migration Strategy

Most enterprises do not get to start clean. They inherit packaged applications, legacy databases, nightly feeds, and accidental canonical models buried in ETL jobs. So the right question is not “what is the perfect streaming architecture?” It is “how do we migrate without breaking the business?”

The answer is progressive strangler migration.

Start by identifying a domain where latency matters and semantics are reasonably stable. Do not start with the most politically charged master data entity in the company. Start where there is clear business value and a bounded scope—order lifecycle, card authorization, shipment status, claim intake.

Then apply a layered migration path.

Step 1: expose changes, but label them honestly

CDC is a perfectly valid first step. It gives visibility and lets downstream teams experiment. But do not market CDC topics as domain events. They are source-system change logs. Useful, yes. Semantically complete, no.

Step 2: introduce domain events at the service boundary

As teams understand the business process better, have the owning domain publish explicit business events from application logic or outbox patterns. This is the point where modeling becomes deliberate. You are no longer leaking persistence structure. You are stating business facts.

Step 3: build integration projections for key consumers

Once one or two downstream domains rely on the events, create explicit translation services or stream processors that produce consumer-aligned projections. This reduces semantic reinterpretation at every endpoint.

Step 4: add reconciliation workflows

Before decommissioning legacy batch feeds, establish a reconciliation mechanism between old and new pathways. Count comparisons, key coverage checks, state equivalence rules, exception queues. The migration is not real until you can prove consistency—or at least explain inconsistency.

Step 5: strangle old integrations incrementally

Retire nightly extracts, point-to-point APIs, or direct database reads one consumer at a time. Keep the old path until the new stream and reconciliation controls have proven themselves over a meaningful operational period.

This pattern matters because enterprises rarely fail at the first event publication. They fail during coexistence.

Here is the migration picture.

Step 5: strangle old integrations incrementally — strangle old integrations incrementally

Reconciliation is not optional

This deserves blunt language. Reconciliation is not a temporary bridge for weak architectures. It is a permanent capability in serious enterprises.

Why? Because streams do not prevent duplicates, out-of-order delivery, retroactive correction, reference data drift, missing events, external system lag, or human intervention. They simply make propagation faster.

Good reconciliation has several layers:

technical reconciliation: did all expected messages arrive and process?
key reconciliation: do entity populations match across systems?
state reconciliation: are the derived business states equivalent?
financial or regulatory reconciliation: do totals, balances, and legal records align?

Architects who skip this usually end up with “eventual consistency” as a slogan and spreadsheet-based repair as the operating model.

Enterprise Example

Consider a multinational retailer modernizing order management across e-commerce, stores, warehouse management, and finance.

The legacy situation is painfully common. E-commerce writes orders into a central Oracle schema. Store systems submit updates through APIs. The warehouse reads order tables directly. Finance receives nightly files. Customer service uses a CRM that keeps its own interpretation of order and customer status. Reporting teams built a separate order mart. Nobody agrees on what “fulfilled” means.

The modernization program introduces Kafka and decomposes capabilities into services: Order, Payment, Inventory, Fulfillment, Customer Notification, Returns. The platform team initially proposes publishing CDC from the order tables and letting everyone subscribe.

That would be quick. It would also be wrong.

Why? Because the order tables contain technical statuses reflecting UI workflow and persistence choices, not stable business facts. A row changing from alloc_state=7 to alloc_state=8 is meaningful to the monolith. It is useless to finance and misleading to customer notification.

A better design looks like this:

Order domain publishes OrderPlaced, OrderConfirmed, OrderAmended, OrderCancelled
Payment domain publishes PaymentAuthorized, PaymentCaptured, RefundIssued
Inventory domain publishes InventoryReserved, InventoryReleased
Fulfillment publishes ShipmentCreated, ShipmentDispatched, ShipmentDelivered, DeliveryFailed

Finance does not subscribe to raw order internals. It consumes an integration projection expressing invoice-relevant events and line-level financial attributes. Customer notification consumes a communication-oriented projection with customer-preferred channel and localized messaging context. The warehouse consumes fulfillment and reservation views, not CRM customer updates.

Now the difficult part: “fulfilled.” In store pickup, an order may be commercially fulfilled when the package is ready for collection. In home delivery, it may mean delivered. In finance, revenue recognition may occur at a different milestone. There is no single universal “fulfilled.” DDD helps here. Each bounded context uses terms that match its decisions. Cross-domain reporting uses translated, reconciled measures rather than pretending the domains are identical.

During migration, nightly finance files remain in place while event-driven projections are introduced. A reconciliation service compares order line populations, shipment totals, tax amounts, and refund events between old and new pipelines. Exceptions are triaged daily for three months. Only then are batch feeds retired.

That is real enterprise work. Less glamorous than drawing topics on a whiteboard, far more likely to survive quarter close.

Operational Considerations

Streaming architectures fail operationally long before they fail conceptually. If you want your schema topology to hold, operations must enforce it.

Ownership and stewardship

Every event topic needs a clear owner. Not the Kafka team. The business-aligned domain team. Ownership means schema decisions, documentation, compatibility review, deprecation policy, and consumer communication.

Platform teams own the rails. Domain teams own the meaning.

Contract governance

Use a schema registry, yes. Automate compatibility checks, yes. But also require semantic review for high-value contracts. A tiny field change can ripple through fraud rules, revenue calculations, and audit processes.

Good governance is selective. Review the contracts that matter. Do not create a central committee for every optional field.

Observability

Observe streams at three levels:

infrastructure health: lag, throughput, partition skew, retention, consumer errors
contract health: version adoption, invalid payloads, dead-letter volume
business health: event counts, lifecycle completion, missing milestones, reconciliation rates

The last one matters most. A healthy cluster can still carry unhealthy business semantics.

Retention and replay

One of Kafka’s great strengths is replay. But replay is only useful when event meaning remains interpretable over time. If schemas evolved carelessly, replay becomes archaeology.

Keep event metadata rich enough for replay and audit: event id, entity id, occurrence time, producer version, correlation id, causation id, source system, and if needed, business effective date.

Data quality and reference data

Streams often depend on shared reference data: product hierarchies, branch codes, risk categories, tax rules. If reference data is inconsistent across domains, your event consumers will derive different truths from identical events.

Reference data governance sounds boring. That is because it is essential.

Tradeoffs

There is no free architecture here.

The schema-topology approach increases upfront design effort. Teams must think harder about domain semantics, event boundaries, and translation layers. This can feel slower than just publishing table changes and moving on.

It also introduces more artifacts: domain events, projections, anti-corruption layers, reconciliation jobs, schema review practices. Architects should be honest about this. You are trading local simplicity for systemic clarity.

The upside is substantial:

clearer ownership
safer evolution
less semantic drift
easier onboarding of consumers
better auditability
reduced long-term integration sprawl

The downside is equally real:

more design discipline required
more translation components to operate
possible duplication of data in projections
pressure from teams that want a canonical shortcut
migration periods with parallel pathways and extra reconciliation cost

A common executive complaint is, “Why do we have three versions of customer in the stream?” The answer is that the business has three meanings of customer. The cost of pretending otherwise shows up later as rework, incidents, and endless “data quality” programs that are really modeling failures.

Failure Modes

The most common failure mode is CDC masquerading as domain architecture. Teams publish change streams from operational tables, call them business events, and hope consumers can figure it out. They cannot, at least not consistently.

Another is the canonical event trap. A central team creates enterprise-wide schemas for Customer, Order, Product, and Payment, trying to satisfy all consumers. These schemas become bloated, politically negotiated, and semantically diluted. Change slows to a crawl. Teams route around the standard. Shadow topics appear. Governance loses credibility.

A third failure mode is consumer-defined semantics. Producers emit vague events like OrderUpdated, and each consumer interprets field combinations differently. This works until audit, financial discrepancy, or regulatory review reveals five versions of the truth.

Then there is reconciliation denial. Architects declare eventual consistency and omit control processes. Operational teams later discover missing events, duplicate captures, delayed corrections, and impossible balances. They build ad hoc repair scripts and spreadsheet reports. Congratulations: you have reinvented manual batch operations on top of a streaming platform.

Finally, semantic version negligence is deadly. Teams make “compatible” changes that alter business meaning without changing event names or versions. Consumers continue processing successfully while making wrong decisions. These are the nastiest failures because monitoring rarely catches them early.

When Not To Use

Do not use a rich streaming schema-topology approach everywhere.

If you have a simple, tightly coupled system with one producer and one consumer inside a single team, a full event-contract discipline may be overkill. A synchronous API or even direct persistence integration might be simpler and perfectly adequate.

Do not force streaming where business latency requirements are measured in days and the reconciliation cost outweighs the benefit. Plenty of enterprise processes are still better served by well-designed batch, especially where source systems cannot emit trustworthy real-time events.

Do not invest heavily in domain event modeling if the domain itself is unstable and not yet understood. In very early product discovery, over-modeling can fossilize immature concepts. Learn first. Formalize once the language settles.

And do not use streaming as camouflage for unresolved ownership. If nobody can answer who owns customer identity versus customer billing relationship, adding Kafka will not help. It will merely distribute the confusion more efficiently.

Several patterns complement this approach.

Outbox pattern for reliable domain event publication from transactional systems.

Anti-corruption layer for translating between bounded contexts without leaking one model into another.

Event-carried state transfer where consumers need richer state snapshots, used carefully and labeled honestly.

CQRS projections for building read models tuned to consumer use cases.

Data mesh thinking where domains own analytical data products, provided this is grounded in actual domain semantics and not hand-waving about decentralization.

Master data management where legal or enterprise identity resolution is genuinely required, but kept separate from the fantasy of a universal canonical event model.

These patterns help, but none replace the central discipline: understand the domain, decide ownership, model meaning, and make translation explicit.

Summary

Streaming pipelines are powerful. They reduce latency, decouple systems, and create a nervous system for the modern enterprise. But nerves do not create meaning. They transmit signals. The body still needs organs, boundaries, and a brain.

Data modeling does not disappear in streaming architecture. It becomes more important because every schema is now a contract in motion, every event a public statement of business truth, and every consumer a potential amplifier of ambiguity.

The practical answer is schema topology: a deliberate arrangement of source schemas, domain events, integration projections, and analytical models, all grounded in domain-driven design. Publish business facts, not table accidents. Let bounded contexts speak their own language. Translate explicitly between them. Migrate progressively with strangler patterns. Reconcile relentlessly.

If that sounds less magical than “just put it on Kafka,” good. Enterprise architecture should be suspicious of magic. Real systems are built from decisions about meaning, ownership, and failure.

The stream is only the river.

You still need a map.

Frequently Asked Questions

What is event-driven architecture?

Event-driven architecture (EDA) decouples services by having producers publish events to a broker like Kafka, while consumers subscribe independently. This reduces direct coupling, improves resilience, and allows new consumers to be added without modifying producers.

When should you use Kafka vs a message queue?

Use Kafka when you need event replay, high throughput, long retention, or multiple independent consumers reading the same stream. Use a traditional message queue (RabbitMQ, SQS) when you need simple point-to-point delivery, low latency, or complex routing logic per message.

How do you model event-driven architecture in ArchiMate?

In ArchiMate, the Kafka broker is a Technology Service or Application Component. Topics are Data Objects or Application Services. Producer/consumer services are Application Components connected via Flow relationships. This makes the event topology explicit and queryable.