⏱ 20 min read
There is a particular kind of optimism that appears a few months after a Kafka rollout. event-driven architecture patterns
The platform team has done the hard yards. Brokers are clustered. Topics are provisioned. Schemata are in a registry. Microservices are producing and consuming messages. Dashboards glow reassuringly green. Someone, usually in a steering committee, says: “We’re event-driven now.” microservices architecture diagrams
That sentence is almost always wrong.
Kafka is excellent technology. It is durable, fast, scalable, and operationally proven. It can become the backbone of a modern data and integration platform. But a backbone is not a nervous system, and a message log is not a domain model. If you take a tangled estate of services, databases, and competing business definitions and connect them with Kafka, you do not magically get event-driven architecture. You get a distributed system that can now spread confusion at higher throughput.
This is the central mistake. Teams confuse asynchronous transport with event-driven design. They treat topics as architecture, publish every table change as an “event,” and call the result decoupled. Meanwhile topic sprawl appears, semantics drift across bounded contexts, consumers code against accidental details, and the enterprise ends up with a shinier version of the same integration mess it had with ESBs and shared databases.
An event-driven architecture is not defined by Kafka. It is defined by business facts, domain boundaries, ownership, temporal thinking, and explicit contracts about what happened and why it matters. Kafka can support that beautifully. It can also undermine it if used as a glorified firehose.
That distinction matters because the bill always arrives later. Not in the demo. In year two. When a dozen teams have built assumptions into consumers. When “CustomerUpdated” means five different things. When reconciliation jobs outnumber business services. When audit asks why downstream systems disagree. When a domain event turns out to be a CRUD notification wearing better clothes.
This article is about that bill: why it appears, how to avoid it, and what an enterprise-grade path looks like if you want event-driven architecture rather than just event-shaped plumbing.
Context
Most enterprises do not start clean.
They have a core platform—ERP, policy admin, order management, payments, CRM, warehouse, whatever keeps the lights on. Around it sit line-of-business applications, reporting stores, digital channels, partner integrations, and a growing constellation of microservices. The initial pressure is usually sensible: reduce coupling, integrate faster, avoid brittle point-to-point APIs, support streaming use cases, maybe modernize the data estate. Kafka enters the scene because it solves several very real problems at once.
And then the architecture slides.
Instead of asking, “What business events exist in this domain, who owns them, and what decisions should they enable?” teams ask, “What should we put on a topic?” That is an implementation-first question. It sounds practical. It usually produces semantic debt.
Domain-driven design gives us a better lens. A business event is meaningful inside a bounded context. It expresses something that the domain considers noteworthy: OrderPlaced, PaymentAuthorized, PolicyBound, ShipmentDispatched. It is not merely “row changed in table X.” The event carries domain meaning, comes from an authoritative source, and should be stable enough that other teams can build on it without reverse-engineering internals.
That is the standard. Most Kafka programs fall below it at first.
Not because teams are foolish. Because they are under delivery pressure, and infrastructure is easier to standardize than meaning. Brokers can be installed in a quarter. Shared language takes longer.
Problem
The failure pattern usually has three symptoms.
First, topic sprawl. Every team creates topics for its own convenience. Some are domain events, some are CDC feeds, some are commands disguised as events, some are integration messages, some are analytics extracts, some are retry channels nobody owns. Naming conventions decay. Discovery becomes tribal. Governance responds with forms and committees, which slows things down without fixing semantics.
Second, semantic drift. Producers evolve messages according to local needs. A field called status starts as a business lifecycle state, then becomes a process checkpoint, then a UI convenience flag. Consumers infer meaning from patterns not contracts. Different bounded contexts attach different business significance to the same event name. Before long, the enterprise has a distributed thesaurus of partial truths.
Third, accidental coupling through streams. Teams think they are decoupled because they do not call each other synchronously. They are not. They are coupled through schemas, delivery timing, ordering assumptions, retention policies, replay behavior, and undocumented interpretations. In many estates that coupling is harder to see and therefore harder to govern than REST dependencies.
A Kafka-based estate can become the new shared database. Only noisier.
Topic sprawl and semantic drift
The irony is that Kafka makes this easier to do, not harder. A low-friction platform amplifies both good design and bad design. If your domain language is clear, Kafka scales it. If your semantics are muddy, Kafka industrializes the mud.
Forces
Good architecture is forged between competing forces, not by slogans. Here the forces are strong and they pull in different directions.
Speed versus semantic quality
Teams want to integrate now. The business often rewards visible movement over conceptual clarity. Publishing table changes via CDC is fast. Designing stable domain events across bounded contexts is slower. The shortcut is attractive because it works immediately. It also externalizes the producer’s internal model and leaves consumers to absorb the mess.
Autonomy versus coherence
Microservices promise team autonomy. Domain-driven design encourages local models. That is healthy—up to a point. Enterprises still need coherent enterprise semantics for cross-cutting concepts like customer, order, invoice, claim, shipment, and payment. Too much local freedom creates translation chaos. Too much centralization creates an architecture review board masquerading as progress.
Historical truth versus current state
Event-driven systems are temporal. The time at which something happened matters. Yet many enterprise applications are state-centric. They know what is true now, not what changed and why. Teams end up emitting synthetic events from state differences. Sometimes that is acceptable. Often it produces fragile approximations that break under replay, reordering, or correction scenarios.
Reliability versus simplicity
Consumers want exactly-once semantics, strict ordering, no duplicates, no gaps, infinite retention, instant consistency, and zero operational burden. Reality does not oblige. Kafka can support strong guarantees in narrow cases, but enterprise workflows still need idempotency, reconciliation, compensations, and explicit handling of out-of-order and missing events. Simplicity is often purchased by pretending failure modes will not happen.
Local optimization versus enterprise value
A team may publish a stream perfect for its immediate consumers but harmful for the larger landscape. For instance, exposing every account balance recalculation as a fine-grained event may overwhelm downstream systems that actually need business milestones like InvoiceSettled or CreditLimitBreached. Event-driven architecture is not “publish everything.” It is “publish what matters.”
Solution
My advice is blunt: design events as domain contracts, not transport payloads.
Start with bounded contexts. Identify where a business fact becomes authoritative. Ask what happened in business terms, who owns that meaning, and which downstream decisions legitimately depend on it. That gives you candidate domain events. Then separate those from other message types you may still need:
- Domain events: meaningful business facts from a bounded context.
- Integration events: messages shaped for cross-context consumption, sometimes derived from domain events.
- CDC events: technical change notifications from persistence.
- Commands: requests for another component to do something.
- Processing events: internal workflow or orchestration signals.
This taxonomy matters. Many estates get into trouble because everything is called an event. Once all messages wear the same uniform, nobody can tell who they serve or how stable they should be.
A practical enterprise approach looks like this:
- Establish event ownership by bounded context.
Sales owns OrderPlaced; payments owns PaymentAuthorized; fulfillment owns ShipmentDispatched. Nobody else publishes “their version” of those facts.
- Define semantic contracts explicitly.
Event names, intent, invariants, key identifiers, timestamps, causation/correlation metadata, and compatibility rules must be documented and versioned. Schema registry helps, but schemas alone do not define meaning.
- Use CDC deliberately, not aspirationally.
CDC is an excellent migration and data propagation tool. It is not automatically a domain event stream. Treat it as a lower-level feed unless you have done the work to map it into business semantics.
- Create translation where contexts differ.
A customer in CRM may be a party in MDM and a policyholder in underwriting. Forcing one universal event usually creates nonsense. Use anti-corruption layers and translators between contexts.
- Design for replay and correction.
Events are rarely perfect. Late data arrives. Facts are corrected. Systems miss messages. Reprocessing and reconciliation are not edge concerns; they are the architecture.
- Govern by catalog and stewardship, not bureaucracy.
You need discoverability, lineage, ownership, and standards. You do not need a central committee naming every topic. Good governance enables local delivery while preventing semantic landfill. EA governance checklist
The heart of the solution is domain-driven design thinking. Kafka is the pipe. The architecture lives in the language and boundaries.
Architecture
A sound enterprise event architecture is layered. Not everything should be published raw and consumed directly by everybody.
A few principles make this work.
Separate domain streams from public integration streams
Inside a bounded context, the producing team may have richer internal events than the rest of the enterprise should see. That is fine. Publish a curated integration stream for broader use. This protects internal evolution and reduces accidental coupling. It is the event equivalent of not exposing your database schema as a public API.
Treat keys and ordering as business decisions
Ordering is not free, and global ordering is fantasy at enterprise scale. Partitioning should align with business identity and consistency needs. If consumers require all events for an OrderId in sequence, key by OrderId. If they really need a broader process order across payment and shipment, they may not want a stream at all—they may need a process manager, a saga, or a reconciled read model.
Include lineage metadata
Every serious event should carry event time, source system, producer version, correlation ID, causation ID, and ideally a business identifier meaningful to downstream users. This is not paperwork. It is what makes debugging possible at 2 a.m.
Expect polysemy and model translations
“Customer” is a dangerous word in an enterprise. It often means account holder, legal entity, contact, policyholder, subscriber, buyer, or party. Do not flatten such concepts into one stream because the word looks shared. Preserve bounded context language and translate deliberately.
Build reconciliation paths
Eventually something will drift. A consumer falls behind. A message is malformed. A producer deploys a bug. A downstream system misses an update. If your only recovery mechanism is “replay the topic and hope,” you are underdesigned. You need authoritative snapshots, point-in-time extracts, repair jobs, and mismatch detection between systems of record and derived views.
Migration Strategy
Nobody gets from legacy integration to sound event-driven architecture in one leap. The right migration is progressive and frankly a little untidy. That is normal.
The best pattern here is a progressive strangler. You do not begin by declaring the enterprise event-driven. You begin by introducing event semantics around valuable seams.
Stage 1: Make existing changes visible
Use CDC or application hooks to expose changes from core systems. Do this with honesty: label these feeds as technical change streams, not polished domain events. They are useful for analytics, cache invalidation, synchronization, and learning where dependencies exist.
Stage 2: Identify high-value business events
Study consumer behavior. Where are teams repeatedly inferring the same business fact from low-level changes? That is where a first-class domain or integration event should emerge. Replace consumer guesswork with an explicit contract.
Stage 3: Introduce translation layers
Build event translators or integration services that convert technical feeds into semantic streams. This is where anti-corruption layers earn their keep. They stop a legacy model from infecting the target architecture.
Stage 4: Move source ownership upstream
Once confidence grows, shift event publication closer to the application service or domain model where the business decision is made. This usually produces cleaner timing, richer intent, and less leakage of database design.
Stage 5: Retire redundant topics and consumers
Migration fails when every old and new stream survives forever. Topic retirement must be explicit: deprecation notices, usage discovery, migration windows, and kill dates.
This migration approach has a crucial advantage: it accepts that semantics mature. Early streams teach you where the real business facts are. The mistake is pretending stage 1 is stage 5.
Reconciliation during migration
Migration creates periods where multiple representations coexist: the old batch interface, the new event stream, a newly built read model, perhaps a surviving API. During this phase, reconciliation is not optional.
You need to compare counts, states, and critical business invariants across old and new paths. For example:
- orders created in source versus orders observed in event stream
- payments authorized versus invoices marked payable
- shipment events versus warehouse status changes
- customer consent changes versus downstream marketing eligibility
Without reconciliation, teams discover semantic mismatches only after business users do. And business users are much less charitable.
A mature migration plan includes:
- dual-run periods
- automated discrepancy reports
- replayable backfill mechanisms
- operational thresholds for acceptable lag and mismatch
- explicit ownership for investigating breaks
That sounds expensive. It is cheaper than discovering after cutover that “completed” meant one thing in the source and another in the consumer.
Enterprise Example
Consider a global insurer modernizing claims and policy servicing.
The estate is familiar: a policy administration platform, a claims system, CRM, document management, payment platform, data warehouse, and a growing set of digital microservices for broker portals and customer self-service. The firm introduces Kafka to decouple integrations and enable near real-time updates.
The first year looks successful. Teams publish topics like PolicyUpdated, ClaimUpdated, CustomerUpdated, and PaymentUpdated. Consumers multiply. The portal team uses these streams to refresh views. The analytics team ingests them into a lakehouse. Finance builds controls around payment flows.
Then the cracks appear.
PolicyUpdated is emitted for premium recalculation, broker reassignment, endorsement issue, document regeneration, and address change. Some consumers treat it as “policy materially changed.” Others treat it as “refresh your copy.” The claims team expects policy coverage changes to arrive before first-notice-of-loss adjudication. Sometimes they do not. The CRM team enriches CustomerUpdated with contact preferences that are not authoritative in the policy context. Downstream marketing uses them anyway. Finance notices payment reversal scenarios are not represented clearly enough to reconstruct ledger state. Replay causes duplicate correspondence because the document service assumed events were always new.
None of this is a Kafka failure. It is a semantic failure.
The insurer resets the architecture around bounded contexts.
- Policy context now publishes explicit events such as
PolicyBound,PolicyEndorsed,CoverageChanged, andPolicyCancelled. - Claims context publishes
ClaimRegistered,ClaimAccepted,ClaimRejected,ClaimSettled. - Payments context publishes
ClaimPaymentAuthorized,ClaimPaymentIssued,ClaimPaymentReversed. - CDC feeds from old systems remain, but are reclassified as technical streams and hidden from most consumers.
- A curated event catalog records definitions, examples, ownership, compatibility rules, and intended consumers.
- Read models for the portal are rebuilt from semantic events plus periodic snapshot reconciliation.
- A reconciliation service compares policy state in the administration system with downstream policy views, flagging divergence by business key.
The result is not perfection. It is clarity. Fewer consumers depend directly on opaque update streams. New teams discover the right event instead of inventing their own. Most importantly, business conversations improve. Architects and domain experts can point to events and say, “That is a meaningful fact,” rather than “That usually means the row changed in a way we think matters.”
That is what good event-driven architecture feels like. Less clever. More trustworthy.
Operational Considerations
Operational discipline is where lofty architecture survives contact with production.
Schema evolution is necessary but insufficient
Backward and forward compatibility matter. So do required fields, defaults, enum changes, and version strategy. But schema compatibility does not guarantee semantic compatibility. Adding an optional field can still break consumers if it changes the implied meaning of an event. Platform tooling catches structure. Stewardship must catch meaning.
Consumer idempotency is table stakes
Duplicates happen. Replays happen. Retries happen. If a consumer sends an email, posts a payment, creates a case, or triggers a workflow, it must be idempotent or explicitly deduplicated. “Kafka guarantees” is not a strategy.
Lag is a business metric, not just a platform metric
Consumer lag in partitions is useful. But enterprises should also measure business lag: time from OrderPlaced to visibility in fulfillment, from PaymentAuthorized to financial control update, from consent change to downstream enforcement. Platforms move bytes; businesses care about consequences.
Dead-letter topics need ownership
A dead-letter queue without an operating model is where messages go to be forgotten. Every DLQ should have triage ownership, retention policy, replay procedure, and clear classification of technical versus semantic errors.
Replay is both power and danger
Replaying a topic can rebuild a read model. It can also resend side effects, violate assumptions, or corrupt downstream systems that are not replay-safe. Mark side-effecting consumers clearly and separate derivation consumers from action consumers where possible.
Retention and compaction are design choices
Retaining event history supports audit and rebuilds. Compaction supports latest-state use cases. Many enterprises need both, but on different streams. Deciding this late often forces awkward compromises.
Tradeoffs
There is no free architecture here.
A strongly semantic event model improves clarity and resilience, but it takes longer to design and requires tighter collaboration with domain experts. Teams used to publishing arbitrary payloads will feel constrained.
Using curated integration events reduces coupling, but it introduces translation layers and can delay access to low-level changes some consumers legitimately need.
CDC accelerates migration and broad data availability, but if overused it leaks implementation details and invites consumers to depend on unstable internals.
Rich governance improves discoverability and trust, but heavy centralized control can become a bottleneck and drive teams into shadow topics.
Reconciliation and repair mechanisms increase operational cost, but pretending they are unnecessary only shifts cost into outages and manual correction later.
This is the real tradeoff line: you can pay in design and governance now, or pay in ambiguity and repair later. Large enterprises often try to avoid both. They never do.
Failure Modes
Some failure modes are common enough to deserve names.
Event-shaped CRUD
A service emits CustomerCreated, CustomerUpdated, CustomerDeleted, but these are just wrappers around persistence operations. Consumers cannot tell what business significance changed, so they inspect fields and invent rules. The event model becomes a remote table API.
Canonical model fantasy
An enterprise attempts one universal event schema for concepts like customer or order across all domains. The result is either bloated and meaningless or so abstract that nobody can use it cleanly. Canonical models are often political compromises pretending to be architecture.
Topic per team, no stewardship
Teams create topics freely, ownership is unclear, names are inconsistent, and consumer discovery is tribal. Reuse collapses. Duplicate streams proliferate. The platform becomes a marketplace of near-synonyms.
Hidden synchronous dependency
A consumer treats an event stream as if it were a request-response API with strict timing guarantees. When lag or temporary unavailability occurs, business processes fail because asynchronous design was accepted in infrastructure but rejected in process design.
No correction model
Events are assumed to be final. Then the business needs corrections, reversals, or late-arriving data. Teams mutate topics, issue opaque “update” events, or manually patch databases. Auditability and consumer logic both deteriorate.
Replay catastrophe
A topic is replayed to rebuild a projection, but downstream side-effecting consumers are not isolated. Duplicate notifications, duplicate payments, and duplicate case creation follow. The platform is blamed; the real issue is missing replay boundaries.
When Not To Use
You should not force event-driven architecture everywhere.
Do not use Kafka as a substitute for straightforward synchronous collaboration when the business process requires immediate validation and simple request-response semantics. Not every interaction improves when placed on a log.
Do not expose domain events from systems that cannot reliably determine business facts and only know state snapshots. In such cases, an API, batch export, or CDC feed may be more honest than pretend events.
Do not build a semantic event mesh for domains that are still in rapid conceptual churn with no stable language, unless you can absorb frequent contract changes. Sometimes the domain needs to settle before the event model should harden.
Do not put side-effect-heavy workflows on streams unless you are prepared for idempotency, compensation, replay safety, and observability. If your operating model is not ready, a simpler orchestration approach may be safer.
And do not use Kafka because the company wants to say it has an event-driven architecture. Architecture by branding has a short shelf life.
Related Patterns
Several patterns fit naturally around this approach.
Outbox pattern for reliable event publication from transactional systems. Essential when domain events must be emitted with data changes without dual-write risk.
Change Data Capture for migration, legacy visibility, and broad propagation of state changes. Valuable, but not equivalent to semantic eventing.
Anti-corruption layer for translating between legacy schemas and target domain language. This is often the key to progressive migration.
Saga / process manager when business processes span multiple services and need coordination rather than naïve assumptions about eventual ordering.
CQRS and read models for building derived views optimized for consumers, especially portals and analytics, using event streams plus snapshot/reconciliation strategies.
Event sourcing, occasionally. Useful when the domain genuinely benefits from an event-native model and temporal audit is central. Not a prerequisite for event-driven architecture, and often overused.
Summary
Kafka is a formidable piece of engineering. But it does not make your architecture event-driven any more than buying a violin makes you a musician.
Event-driven architecture begins with business meaning. It depends on bounded contexts, explicit ownership, stable semantics, and a willingness to model time, correction, and disagreement between systems. Kafka can carry those events elegantly. It can also carry nonsense at scale.
The enterprise danger is not technical failure first. It is semantic decay: topic sprawl, event-shaped CRUD, duplicate meanings, accidental coupling, and consumers coding against implied behavior instead of domain contracts. By the time this becomes visible, the platform is successful enough that undoing it is painful.
The remedy is not less Kafka. It is more architecture.
Use domain-driven design to decide what constitutes an event. Separate technical change feeds from semantic contracts. Introduce curated integration streams. Migrate progressively with a strangler approach. Build reconciliation into the plan, not as an afterthought. Govern by ownership and catalogs rather than committees and folklore. Be explicit about tradeoffs and honest about failure modes.
Most of all, remember this: an event is not something that changed in a database; it is something the business cares happened.
If you hold that line, Kafka becomes a powerful enabler.
If you do not, it becomes the fastest way in the building to distribute ambiguity.
Frequently Asked Questions
What is event-driven architecture?
Event-driven architecture (EDA) decouples services by having producers publish events to a broker like Kafka, while consumers subscribe independently. This reduces direct coupling, improves resilience, and allows new consumers to be added without modifying producers.
When should you use Kafka vs a message queue?
Use Kafka when you need event replay, high throughput, long retention, or multiple independent consumers reading the same stream. Use a traditional message queue (RabbitMQ, SQS) when you need simple point-to-point delivery, low latency, or complex routing logic per message.
How do you model event-driven architecture in ArchiMate?
In ArchiMate, the Kafka broker is a Technology Service or Application Component. Topics are Data Objects or Application Services. Producer/consumer services are Application Components connected via Flow relationships. This makes the event topology explicit and queryable.