The Hard Part of Data Platforms Is Semantics

⏱ 19 min read

Most data platform failures do not begin with bad technology. They begin with polite lies.

A sales system calls something a customer. Billing calls it an account. Support calls it a tenant. Marketing insists it is a lead until some campaign threshold is crossed, after which it becomes a contact, except in Europe where the consent model changes the rules again. Then the data platform arrives with noble intentions and a large budget, scoops all of this into a lake, and announces that the enterprise now has a “single source of truth.”

It doesn’t. It has a larger container for ambiguity.

That is the hard part of data platforms. Not storage. Not distributed compute. Not whether Kafka is better than batch for this workload or whether your warehouse should be federated. The hard part is semantics: what things mean, who gets to define that meaning, and how those meanings travel across an organization without dissolving into a fog of overloaded fields and dashboards nobody trusts. event-driven architecture patterns

If you get semantics wrong, your platform becomes a translation machine that has forgotten all its languages. Data moves beautifully and means nothing reliably. Teams stop arguing in architecture reviews and start arguing in finance meetings. That is when the bill arrives.

The fix is not another canonical model imposed from a central tower. That fantasy has failed in every large enterprise I have seen. The real answer is domain mapping topology: a deliberate architecture that recognizes domain boundaries, models meaning where it originates, and makes translation explicit rather than accidental. This is domain-driven design applied to the data platform, with all the messiness that implies.

A good data platform is not a giant database. It is a semantic contract system with plumbing attached.

Context

For twenty years, enterprises built data estates in cycles. First came operational systems, each optimized for a line of business. Then came the warehouse to bring order. Then came the data lake to absorb everything the warehouse could not. Then stream platforms arrived, often Kafka at the center, promising real-time integration and event-driven elegance. More recently, data mesh and data products reframed the conversation around ownership and domain accountability.

Each wave solved a real problem. None solved semantics by itself.

The modern enterprise now runs a mixed economy of data: transactional databases in microservices, legacy ERP platforms, SaaS applications, event streams, MDM hubs, warehouses, lakehouses, and reverse ETL pipelines pushing data back into tools that should probably not be system-of-records but are treated as such anyway. Architecture diagrams look impressive. Meaning still leaks at every seam. microservices architecture diagrams

This is why domain-driven design matters here. DDD was never just about microservices. It is a way of controlling meaning in large systems. Bounded contexts are not a coding trick; they are a survival mechanism. They acknowledge that “customer” can validly mean different things in different contexts, and that forcing universal agreement often causes more damage than difference itself.

A data platform that ignores bounded contexts becomes an empire of accidental coupling. Every shared table becomes a treaty nobody signed.

Problem

The central problem is simple to state and brutal to solve:

How do you create enterprise-wide data interoperability without flattening the domain semantics that make systems useful in the first place?

Most organizations answer this poorly in one of two ways.

The first is semantic centralization. A central data team defines canonical entities, often with admirable rigor, and tells every system to publish or conform to these shared definitions. This sounds tidy. In practice, it creates endless negotiation, diluted models, and a canonical schema so abstract that it stops representing any real business process. The model becomes politically acceptable because it is operationally useless.

The second is semantic anarchy. Every team publishes whatever they have, labels it a data product, and trusts catalogs, lineage tools, and tribal knowledge to sort out the rest. This scales autonomy but not comprehension. Consumers end up reverse-engineering producer intent from field names and Slack threads. The platform becomes discoverable but not intelligible.

Both fail because they confuse data movement with meaning management.

What is needed is not one enterprise model and not pure decentralization. It is a topology of domain mappings: a set of explicit relationships between bounded contexts, with clear ownership of source semantics, translation rules, and reconciliation logic.

That phrase sounds theoretical. It is not. It is the difference between an order event that tells you “an order was placed” and an order event that can actually be reconciled with invoicing, fulfillment, returns, and finance close.

Forces

Several forces make this hard in the real world.

1. Domains are inherently local

Meaning emerges from process. Sales qualifies leads differently from underwriting. Claims adjudication thinks in incidents and policies, not households and campaigns. Finance cares about legal entity, posting period, and auditability. These are not naming disagreements. They are distinct worldviews shaped by work.

A data platform must preserve that locality or it destroys the very fidelity decision-makers need.

2. Enterprises still need cross-domain views

The board does not care that support and billing use different customer concepts. They want churn, revenue, profitability, risk exposure, and service performance at enterprise scope. Regulatory reporting, customer 360, and operational optimization all require integration across domains.

So the architecture must support enterprise answers without pretending enterprise meaning originates in one place.

3. Change is constant

Product teams evolve workflows. Microservices split and merge. Kafka topics are versioned. Acquisitions introduce entirely new data vocabularies. Regulatory shifts redefine key business concepts. A semantic model that cannot tolerate change becomes a museum.

4. Legacy systems are stubborn

The neatest domain model in your event-driven architecture still depends on SAP, Oracle, mainframes, spreadsheets, and SaaS APIs with awkward semantics. Migration is never greenfield. It is archaeology with SLAs.

5. Reconciliation is unavoidable

Different systems describe the same business reality from different angles and at different times. They disagree. Sometimes they should. Sometimes they must not. Knowing which is which is architecture work, not cleanup.

6. Incentives are misaligned

Producer teams optimize for local delivery. Consumer teams optimize for broad usability. Central data teams optimize for standardization. Governance optimizes for control. Every one of these is rational. Together they can produce paralysis. EA governance checklist

This is why semantics is hard. Not because the concepts are obscure, but because the organization is.

Solution

The pattern I recommend is domain mapping topology.

At its heart are five principles.

1. Model data around bounded contexts, not enterprise fantasies

Each domain owns its semantic model and publishes data from that context. Sales publishes sales concepts. Billing publishes billing concepts. Fulfillment publishes fulfillment concepts. They do not pretend to speak for the rest of the enterprise.

This is classic DDD. The source context owns the language of its events and data products.

2. Make translations explicit

Cross-domain integration happens through mapping layers, not by smuggling universal meaning into source schemas. A domain map defines correspondences between concepts, identifiers, lifecycle states, and quality expectations.

For example, Sales CustomerProspect, Billing Account, and Support Tenant may map into an enterprise analytical concept such as CommercialRelationship, but that concept is not pushed back as the “real truth” into all operational systems.

3. Separate source truth from reconciled truth

Operational truth belongs to source systems within their bounded context. Enterprise truth is often reconciled truth: fit for a purpose, assembled from multiple contexts, and governed by explicit precedence and timing rules.

That distinction matters. Teams get into trouble when they claim a reconciled model is a universal master. It is usually a projection for a use case.

4. Put topology before tooling

Kafka, catalogs, schema registries, lakehouses, dbt, MDM, graph stores—useful tools, all of them. But the architecture starts with identifying domains, upstream and downstream relationships, anti-corruption layers, and reconciliation zones. Tooling should reflect the semantic topology, not define it.

5. Migrate progressively with a strangler approach

No enterprise rewrites semantics in one move. You carve away integration from brittle shared databases and point-to-point ETL piece by piece. New domain-aligned streams and data products coexist with legacy feeds until confidence, coverage, and reconciliation maturity are sufficient.

The migration path is part of the design. If the target architecture only works after full adoption, it is probably a slide deck, not an architecture.

Architecture

A practical architecture has four semantic layers.

  1. Source domain layer
  2. Operational systems and microservices publish events or change data in their own language.

  1. Mapping and translation layer
  2. Anti-corruption logic maps domain terms, identifiers, and lifecycle transitions into downstream contracts. This can be implemented via stream processing, integration services, or transformation pipelines.

  1. Reconciliation layer
  2. Conflicts, duplicates, late arrivals, and cross-domain assembly are resolved for specific enterprise views. This is where survivorship, temporal alignment, and business rules live.

  1. Consumption layer
  2. Analytical models, ML features, APIs, and operational read models consume either source-aligned or reconciled data depending on need.

Here is the topology at a high level.

Diagram 1
Architecture

A few things are worth saying plainly.

First, the reconciliation zone is not a junk drawer. It is where enterprise-level semantics are assembled intentionally. It should be small in number, high in discipline, and aligned to important business use cases.

Second, not every consumer needs reconciled data. Many teams are better served by source-aligned products with clear semantics. Forcing all data through central reconciliation adds latency, cost, and distortion.

Third, Kafka fits naturally as the backbone for domain event propagation, but Kafka does not solve semantic mismatch. A topic named customer-events with Avro schemas is still semantically ambiguous if nobody agrees what a customer is in that context. Schema is not meaning. It is the silhouette of meaning.

Domain contracts and mapping contracts

I prefer to distinguish between two contract types:

  • Domain contracts: published by the owning domain; define source semantics.
  • Mapping contracts: owned by the integration or consuming capability; define how source concepts are translated for a target use case.

This avoids one of the great sins of enterprise integration: asking source teams to own everyone else’s interpretation.

Domain contracts and mapping contracts
Domain contracts and mapping contracts

This sequence looks straightforward. In practice, the architecture work is in the boxes labeled “Map” and “Resolve.” That is where politics, policy, and process are encoded.

Reconciliation as a first-class capability

Reconciliation deserves more respect than it usually gets. Teams often treat it as either data quality cleanup or MDM. It is broader than both.

Reconciliation answers questions like:

  • If Sales says a conversion happened yesterday, Billing says the account became active today, and Support has no tenant yet, what is the lifecycle state for enterprise reporting?
  • If two acquired businesses use different customer identifiers, when do we merge identities and when do we preserve legal separation?
  • If revenue is recognized by Finance on one schedule and invoiced by Billing on another, which timeline is correct for which report?

These are semantic questions with technical consequences.

A sound reconciliation capability usually includes:

  • identity resolution
  • temporal alignment
  • survivorship rules
  • lineage back to source context
  • confidence or quality scoring
  • exception workflows for unresolved conflicts

Without those, your enterprise views become smooth lies.

Migration Strategy

The migration strategy should be progressive strangler, not big bang replacement.

Big bang semantic redesign fails for the same reason big bang ERP programs fail: the business keeps moving while architecture is trying to freeze the map. By the time the canonical model is approved, reality has changed.

A strangler migration works differently.

Step 1: Identify high-friction semantic seams

Find the places where meaning mismatch creates real cost: revenue reporting, customer 360, order-to-cash, claims-to-payment, product profitability. Start where bad semantics hurts decisions, compliance, or money.

Step 2: Draw bounded contexts and existing flows

Do not begin with target schemas. Begin with language. Which systems define what? Where are terms overloaded? Which identifiers are local versus enterprise-scoped? Which integrations are actually hidden semantic transforms?

Step 3: Publish source-aligned data products

Expose domain events or CDC feeds from source systems with clear ownership, lineage, and glossary definitions. Resist the urge to pre-normalize into a central enterprise format.

Step 4: Build anti-corruption mappings for one use case at a time

Create mapping pipelines for a specific enterprise capability, such as churn analytics or revenue reconciliation. Make transformation logic visible and testable.

Step 5: Introduce a reconciliation zone

Persist reconciled entities and facts alongside source lineage. Measure disagreement rates, latency, and confidence before expanding scope.

Step 6: Strangle legacy integrations

As consumers move to mapped or reconciled products, retire point-to-point ETL, shared reporting tables, and brittle batch extracts. Keep both paths during transition long enough to compare outputs and find semantic drift.

Step 7: Institutionalize semantic governance

Not governance as committee theater. Governance as product management for meaning: glossaries tied to domains, contract reviews, versioning policy, and ownership of mappings. ArchiMate for governance

Here is what that progressive migration often looks like.

Step 7: Institutionalize semantic governance
Institutionalize semantic governance

Why this migration works

It creates value early. It limits blast radius. It lets the enterprise compare old and new outputs. Most importantly, it acknowledges that semantics are discovered in use, not declared upfront with complete certainty.

That matters. Architects often want semantic closure too soon. But in migration, some ambiguity must be surfaced before it can be resolved. Parallel runs are not just for confidence; they are for learning what the business really means under pressure.

Enterprise Example

Consider a global telecom company. Three business units had grown through acquisition. Consumer mobile, enterprise connectivity, and digital services all had separate CRMs, billing stacks, support tooling, and data warehouses. Leadership wanted an enterprise customer view and cross-sell analytics. The first attempt was textbook centralization: define a canonical customer model, mandate conformance, and feed a central lake.

It failed in eighteen months.

Why? Because “customer” meant at least four different things:

  • a legal contracting party in enterprise sales
  • a billing account holder in consumer mobile
  • a technical service tenant in digital services
  • a support relationship attached to one or more subscriptions

The canonical model blurred these distinctions. Enterprise contracts got flattened into consumer-like accounts. Technical tenancy was treated as a subtype instead of its own context. Reports began disagreeing with source systems. Trust collapsed. Teams started exporting local extracts to rebuild the views they recognized.

The second attempt was better because it was humbler.

The architecture team redrew the landscape into bounded contexts: Prospect Management, Contracting, Billing, Service Provisioning, Support Case Management, and Finance. Kafka was introduced as the event backbone for new and modernized systems, while CDC captured changes from older platforms. Each domain published source-aligned events and relational snapshots.

Then a separate mapping capability was built for two enterprise use cases only:

  1. group-level customer exposure
  2. order-to-cash revenue reconciliation

The team did not create a universal Customer. They created:

  • CommercialPartyView for exposure and relationship analytics
  • RevenueObligationView for finance and operational reconciliation

Those were reconciled projections, not universal truths.

Identity resolution linked legal entities, billing accounts, subscriptions, and technical tenants through confidence-scored relationships. Support cases remained in their own semantic context but could be associated with reconciled commercial relationships when needed.

The result was not elegance in the abstract. It was usefulness in the concrete. Finance close variance dropped because order, invoice, and activation timing could be explained in one model. Cross-sell analytics improved because the platform distinguished legal party from service footprint. Support metrics became more credible because tenant-level outages were no longer naively rolled up to account-level churn models.

Most importantly, teams stopped fighting the platform. They could see themselves in it.

That is usually the mark of good enterprise architecture. Not universal admiration. Reduced evasive behavior.

Operational Considerations

A semantics-first platform has operational consequences.

Data product ownership

Every domain product needs a real owner with both business and technical accountability. If ownership sits only with the central platform, semantics will drift away from source reality. If it sits only with engineering, business meaning will be under-specified.

Schema evolution

Versioning strategy must distinguish additive technical change from semantic change. A field added for convenience is one thing. A lifecycle state redefinition is another. Consumers need a way to detect meaning shifts, not just schema compatibility.

Metadata and lineage

Catalogs matter, but they must capture semantic lineage, not just pipeline lineage. Consumers need to know not only where a field came from, but what translation and reconciliation rules touched it.

Testing

Traditional data quality tests are not enough. You need semantic tests:

  • does the mapped lifecycle obey business invariants?
  • do reconciled totals tie back to source within tolerance?
  • are identity merges explainable?
  • what percentage of records are unresolved or low confidence?

Latency and consistency

Real-time integration via Kafka is attractive, but not every reconciled view can or should be real time. Some require event-time correction, late-arriving data handling, or period-close controls. Fast wrong answers are still wrong.

Exception handling

Some records cannot be reconciled automatically. Good architecture plans for human workflow, not silent drops. The unresolved queue is part of the system.

Tradeoffs

There is no free lunch here. Domain mapping topology trades one kind of complexity for another.

What you gain

  • clearer ownership of meaning
  • less semantic pollution in source systems
  • more resilient cross-domain integration
  • progressive migration path
  • better trust through explicit reconciliation

What you pay

  • more moving parts
  • duplicated concepts across contexts
  • need for stronger metadata discipline
  • higher upfront design effort in mappings
  • governance that cannot be outsourced to tooling

Some leaders resist this because it looks messier than a canonical enterprise model. They are not entirely wrong. It is messier on paper. But it is often cleaner in operation because it mirrors the business more truthfully.

A canonical model can make architecture diagrams look crisp while making delivery and analytics miserable. I would rather have an architecture that admits the enterprise is plural than one that lies with straight lines.

Failure Modes

This pattern can fail, and it often does in predictable ways.

1. The mapping layer becomes a shadow monolith

If every translation for every use case lands in one giant central team and one sprawling transformation codebase, you have simply rebuilt the enterprise canonical layer under a different name.

2. Reconciliation becomes a dumping ground

When unresolved semantic disagreements are pushed downstream “for later,” the reconciliation zone turns into a swamp of ad hoc rules and exceptions nobody understands.

3. Domains publish low-quality products

Autonomy without discipline produces event streams full of internal implementation detail, missing business definitions, and unstable identifiers. Consumers then rebuild semantics privately.

4. Overreliance on MDM

MDM can help with identifiers and survivorship, but it is not a substitute for domain modeling. One golden record rarely captures the many valid perspectives of a complex enterprise.

5. Governance becomes ceremony

If contract reviews, glossaries, and stewardship processes are detached from delivery teams, they become shelfware. Semantics must live in build pipelines and release practices, not just policy documents.

6. The enterprise view is mistaken for operational authority

A reconciled projection built for analytics gets reused in operational processes and slowly starts acting as the real master. This is dangerous. Reconciled views are often lagged, purpose-built, and lossy outside their intended use.

When Not To Use

This approach is not always necessary.

Do not use a full domain mapping topology if:

  • the business domain is simple and semantics are genuinely stable
  • the organization is small enough that one model is still socially manageable
  • there are only a handful of systems with minimal overlap
  • the primary need is straightforward reporting from one dominant source
  • the cost of semantic reconciliation exceeds the business value

A mid-sized company with one ERP, one CRM, and one warehouse may not need this. A clean dimensional model and sensible data contracts could be enough.

Likewise, if your biggest problem is basic data hygiene—missing keys, broken pipelines, no ownership—do not leap to a grand semantic architecture. First learn to keep the pipes from leaking.

Several related patterns fit naturally alongside this approach.

Data mesh

Useful when interpreted narrowly: domain ownership of data products. Dangerous when interpreted lazily: every team publishes data and the market sorts it out. Mesh needs semantic discipline or it becomes distributed confusion.

Anti-corruption layer

Directly relevant. In DDD terms, the mapping layer between bounded contexts is an anti-corruption layer. It protects local models from foreign concepts while enabling exchange.

Event-driven architecture

Kafka and event streaming work well as transport and decoupling mechanisms. But event-driven systems need semantic governance just as much as batch systems do. A bad event is simply a bad integration that arrives faster.

MDM

Helpful for identity and survivorship in some domains, especially party, product, and location. Not enough on its own for lifecycle semantics or cross-process truth.

Canonical data model

Still useful in narrow integration corridors where semantics are genuinely shared and stable. Bad as a universal ambition. Good servant, terrible religion.

Lakehouse medallion layers

Bronze, silver, gold can be useful physically, but they are not semantic architecture. Many teams mistake layered storage refinement for semantic resolution. It is not the same thing.

Summary

The hard part of data platforms is semantics because semantics is where the enterprise reveals its fractures.

Different domains see the world differently because they do different work. A data platform that tries to erase those differences usually produces a brittle, abstract, untrusted mess. A platform that ignores them produces chaos at scale. The better path is domain mapping topology: bounded contexts at the source, explicit mappings across domains, reconciled truth for specific enterprise purposes, and a progressive strangler migration away from accidental integration.

This is deeply domain-driven design thinking, applied where many enterprises need it most. Not in service boundaries alone, but in the meaning of the data that flows between them.

The architecture is not glamorous. It asks for discipline, humility, and tolerance for plural truths. It requires real reconciliation, not wishful standardization. It demands that Kafka topics, microservices, warehouses, and governance processes all serve semantic clarity rather than perform modernity.

But when it works, something important happens.

People stop asking whether the platform has all the data.

They start trusting what the data means.

And in enterprise architecture, that is about as close to victory as it gets.

Frequently Asked Questions

What is a data mesh?

A data mesh is a decentralised data architecture where domain teams own and serve their data as products. Instead of a central data team, each domain is responsible for data quality, contracts, and discoverability.

What is a data product in architecture terms?

A data product is a self-contained, discoverable, trustworthy dataset exposed by a domain team. It has defined ownership, SLAs, documentation, and versioning — treated like a software product rather than an ETL output.

How does data mesh relate to enterprise architecture?

Data mesh aligns data ownership with business domain boundaries — the same boundaries used in domain-driven design and ArchiMate capability maps. Enterprise architects play a key role in defining the federated governance model that prevents data mesh from becoming data chaos.