Your Data Platform Is a Composition Problem

⏱ 19 min read

Most data platforms fail for a boring reason dressed up as a technical one. We call it integration. We buy tools for it. We stand up programs around it. We make slideware with arrows crossing the page like an airline route map. Then, six months later, nobody trusts the numbers, every dashboard has a footnote, and the platform team becomes a clearinghouse for disputes they can’t actually settle.

That is not a storage problem. It is not primarily a pipeline problem. It is not even, in most enterprises, a scale problem.

It is a composition problem.

The hard part of a modern data platform is not moving bytes from systems of record into a lakehouse, warehouse, or Kafka backbone. The hard part is assembling coherent domain meaning from fragmented operational behavior. One service says “customer,” another says “account,” a third says “party,” and finance quietly maintains a spreadsheet because none of them match the contract used for billing. Technology didn’t create that mess. Technology merely made it visible at speed.

This is where domain assembly topology becomes useful. The name sounds grander than it is. It simply means treating the platform as a deliberate composition of domain semantics, ownership boundaries, and reconciliation flows rather than as a generic data ingestion machine. You stop asking, “How do we centralize all data?” and start asking, “Where is meaning created, where is it refined, where is it assembled, and who is allowed to declare it authoritative?”

That shift matters because enterprises do not run on raw events. They run on assembled meaning.

Context

The last decade gave us a familiar sequence. First came the enterprise data warehouse. Then came the data lake. Then lakehouse patterns, data mesh, event streaming, operational analytics, and every vendor variation of “real-time intelligence.” Each wave fixed a real weakness in the previous one. Warehouses were too rigid. Lakes became swamps. Central teams became bottlenecks. Distributed teams created entropy. Streaming promised freshness and often delivered faster inconsistency.

Underneath all that churn was a stable truth: large organizations are made of domains, not datasets.

Sales, claims, billing, logistics, underwriting, retail banking, customer support, manufacturing planning—these are not merely departmental labels. They are domains with their own language, state transitions, incentives, and error tolerance. In Domain-Driven Design terms, they carry bounded contexts. A “policy” in insurance underwriting is not the same thing as a “policy” in compliance. An “order” in e-commerce checkout is not the same as an “order” in warehouse fulfillment. Pretending otherwise creates fragile integration and fake standardization.

A data platform that ignores bounded contexts becomes a giant semantic blender. Everything goes in. Nothing comes out clean.

The better architecture accepts that domain boundaries are not obstacles to integration. They are the only sane foundation for it.

Problem

Many platform programs are still organized around a hidden fantasy: if we can centralize enough data and apply enough transformation, we can derive a single, clean enterprise truth. The fantasy survives because it sounds executive-friendly. It promises harmonization, governance, and lower cost. But in practice it usually produces one of three outcomes. EA governance checklist

The first is semantic erosion. As data crosses ingestion pipelines, CDC connectors, ETL jobs, Kafka topics, and warehouse models, it loses the context that gave it meaning. Source fields become opaque attributes. Business states become timestamped rows. Exceptions become nulls. The platform knows more data than the business can explain.

The second is premature canonicalization. An architecture team tries to define an enterprise canonical model before understanding how domains actually behave. The model becomes either too abstract to be useful or too opinionated to survive contact with reality. Teams comply on paper and route around it in code.

The third is centralized bottlenecking. Every cross-domain question lands on a central data engineering team. They become the translator, referee, and emergency plumber for every semantic dispute. Delivery slows. Trust drops. Shadow platforms appear.

The common mistake is treating integration as a plumbing concern when it is really a concern of assembly. Data from operational systems is not enterprise information until someone resolves identity, state, timing, ownership, and business intent.

That resolution is architecture.

Forces

A serious architecture has to account for the forces pulling the platform in different directions. Ignore them and the design will look elegant in diagrams and miserable in production.

1. Domains create meaning locally

The operational truth about a shipment belongs first to logistics. The truth about invoice status belongs first to billing. The truth about customer risk belongs first to risk management. These truths are local before they are enterprise-wide.

This is why domain-oriented data products matter. Not because “data product” is fashionable language, but because local ownership is the only place where semantics can be maintained with discipline.

2. Enterprise use cases demand composition

Executives do not ask domain-local questions. They ask assembled ones.

  • Which customers are profitable after claims cost, support cost, and retention spend?
  • Which suppliers are causing margin leakage across procurement, manufacturing, and returns?
  • Which patients are high risk based on appointments, prescriptions, billing, and care history?

Those are composition questions. They require multiple bounded contexts to be assembled without flattening them beyond recognition.

3. Time is not consistent across systems

Operational systems disagree not just in format but in timing. One system emits events instantly. Another updates nightly. A third backfills corrections once a week. If your topology assumes synchronized truth, it will lie confidently.

4. Microservices increase semantic fragmentation

Microservices can improve team autonomy, but they also splinter what used to be one database worldview into many service-local realities. Kafka helps distribute facts; it does not automatically reconcile them. In fact, event-driven architectures can make inconsistency arrive faster. event-driven architecture patterns

5. Governance has to scale without centralizing every decision

Security, privacy, retention, lineage, and policy cannot be optional. But if governance requires every schema change to be approved by a central committee, the business will outrun the platform. ArchiMate for governance

6. Migrations are unavoidable

No enterprise starts clean. You inherit mainframes, ERP packages, SaaS platforms, shadow spreadsheets, MDM tools, cron jobs, and twenty years of naming accidents. Architecture has to explain how to move from there to something better without betting the company on a single cutover.

These forces make one thing clear: the platform must support both distributed semantic ownership and deliberate enterprise assembly.

Solution

The pattern I recommend is domain assembly topology.

In plain language: organize the data platform into layers of semantic responsibility.

  1. Source-aligned domain data products preserve operational meaning close to where it is created.
  2. Assembly services or assembly data products combine domain outputs into enterprise concepts for specific decision areas.
  3. Reconciliation capabilities explicitly manage disagreement, lag, correction, and identity resolution.
  4. Governance and platform services provide shared controls, discoverability, contracts, lineage, and runtime standards without dictating all semantics from the center.

This is not a return to the old central warehouse team with better branding. It is also not a naive interpretation of data mesh where every team publishes whatever they like and hopes the marketplace sorts it out.

The critical idea is that enterprise truth is assembled, not harvested.

A customer 360, for example, is almost never a source system artifact. It is an assembly. It might draw identity from CRM, billing status from finance, service history from support, consent flags from privacy systems, and digital engagement from web analytics. Pretending one source “owns” the enterprise customer truth is usually nonsense. But pretending nobody does is equally bad. The assembled concept needs an owner, a contract, and reconciliation rules.

That owner should not be a generic platform team. It should be the team responsible for the business capability that depends on the assembled concept.

That is domain-driven design applied to the data platform: align semantic models and ownership with business capabilities, not technology layers.

Architecture

At its core, domain assembly topology has four layers.

Architecture
Architecture

1. Operational systems and event sources

These include transactional databases, SaaS systems, legacy applications, and service events. In a microservices estate, Kafka often sits here as the event backbone, carrying facts from service boundaries into the broader platform. CDC may also be used where event publication is weak or absent. microservices architecture diagrams

The rule: ingest with minimal semantic distortion. Preserve source context, timestamps, keys, change types, and provenance.

2. Domain data products

These are curated, source-aligned representations managed by domain teams or platform teams embedded with them. They are not raw dumps, and they are not enterprise-wide canonical models. They should expose meaningful business entities, events, and states as understood within that bounded context.

Examples:

  • Billing exposes invoice lifecycle and payment status.
  • Claims exposes claim registration, adjudication, reserve changes, and settlement.
  • Logistics exposes shipment milestones and exceptions.
  • Customer support exposes case lifecycle and sentiment annotations.

A domain product should be opinionated enough to be useful and narrow enough to remain truthful.

3. Assembly products and services

This is where cross-domain meaning is constructed. An assembly layer might produce:

  • Customer profitability
  • Policy risk exposure
  • Supplier performance scorecards
  • Omnichannel order health
  • Enterprise inventory availability

An assembly is not just a SQL join with a nicer name. It requires explicit business logic for identity matching, conflict resolution, latency handling, state derivation, and exception management.

In some cases this is best implemented as data models in the warehouse or lakehouse. In others, especially when operational consumers need current state, it belongs in stream processing or a dedicated service using Kafka, Flink, Spark Structured Streaming, or similar tools. The point is not the tool. The point is the semantic responsibility.

4. Shared platform capabilities

A platform still matters. A lot. But its job changes. It should provide:

  • storage and compute
  • event transport
  • schema registry and contract validation
  • data catalog and lineage
  • policy enforcement
  • quality rules frameworks
  • observability
  • identity and access controls
  • lifecycle tooling

What it should not do is quietly become the owner of every enterprise concept.

Here is a more detailed view.

4. Shared platform capabilities
Shared platform capabilities

Domain semantics and reconciliation

This is the part architects often hand-wave. Don’t.

Cross-domain assembly fails not because joins are difficult, but because meanings diverge.

Take something as simple as “active customer.” Sales may define it as any account with an open opportunity in the last 12 months. Billing may define it as any party with an invoice in the last 90 days. Support may define it as anyone with an active entitlement. Marketing may define it as any contact who has engaged digitally.

All of those can be valid inside their own bounded contexts. None of them is automatically wrong. The mistake is forcing them into a single universal definition too early.

Reconciliation is how you keep the distinctions visible while still assembling useful enterprise views. It usually includes:

  • Identity reconciliation: deciding whether records from multiple systems represent the same business entity.
  • Temporal reconciliation: aligning facts with different event times, processing times, and correction windows.
  • State reconciliation: deriving enterprise state from conflicting or incomplete local states.
  • Rule reconciliation: applying business precedence, confidence scores, survivorship rules, or manual exception handling.

This deserves explicit architecture. Reconciliation is not a side effect of ETL.

If you use Kafka, this often means separating domain event streams from assembly streams. Let billing publish billing facts. Let support publish support facts. Then let an assembly capability consume them, correlate them, and publish an assembled view with clear lineage. Do not ask every producing service to emit an enterprise-ready customer truth. That is how service boundaries get polluted and teams lose autonomy.

Migration Strategy

Most firms cannot jump straight into a domain assembly topology. They have a legacy warehouse, brittle ETL, duplicated semantics, and a dozen “gold” tables that are only gold because nobody dares touch them.

So migrate progressively. Use a strangler approach.

Start by identifying one high-value assembled concept where the current platform is painful and politically visible. Customer 360 is common. Claims exposure in insurance. Revenue leakage in telecom. Product availability in retail.

Then:

  1. Map current semantic sources
  2. Document where the concept is partially defined today. Not just systems, but reports, spreadsheets, operational workarounds, and manual reconciliations. This is where the real truth is hiding.

  1. Establish source-aligned domain products
  2. For the participating domains, create stable, governed outputs that preserve local semantics and provenance.

  1. Build a parallel assembly
  2. Stand up the new assembly product beside the legacy warehouse model. Do not cut over immediately. Run both.

  1. Measure divergence
  2. Reconcile the new assembly with the old outputs. Expect discrepancies. Those discrepancies are architecture feedback, not project failure.

  1. Introduce exception workflows
  2. Some conflicts cannot be algorithmically resolved. Route them to business stewards where needed. You need a humane operating model, not just code.

  1. Shift consumers gradually
  2. Move dashboards, analytical products, and downstream integrations one by one. Keep lineage visible so people know what changed.

  1. Retire legacy transformations in slices
  2. Remove old jobs only once consumers have moved and reconciliation confidence is acceptable.

A migration diagram makes the pattern clearer.

Diagram 3
Your Data Platform Is a Composition Problem

The key is progressive replacement, not heroic rewrite. A platform migration that demands semantic perfection before use will stall. One that embraces staged reconciliation can move while learning.

Enterprise Example

Consider a large insurer operating across personal auto, home, and commercial lines. Over the years it accumulated:

  • a policy administration platform for each line of business
  • a central claims system
  • separate billing engines due to acquisitions
  • CRM for agents and service reps
  • a customer portal with its own profile store
  • Kafka-based microservices for digital quote and bind journeys
  • a warehouse feeding finance and actuarial reports

Leadership wanted a “customer and policy 360.” They already had one in name. In reality they had five competing versions:

  • underwriting’s version centered on named insureds
  • billing’s version centered on account holders
  • claims’ version centered on claimants
  • CRM’s version centered on households and contacts
  • digital’s version centered on authenticated portal identities

The old instinct was to create a canonical customer model in the enterprise warehouse and force all feeds to map into it. They tried that. It failed in the usual way. Every domain had exceptions. Household relationships changed by product line. Commercial policies had entities that were not natural persons. Claims involved third parties with no billing relationship. The canonical model became an argument preserved in DDL.

A better approach emerged.

Each domain published a source-aligned data product:

  • policy domain exposed policy lifecycle, insured parties, coverages, and agents
  • claims domain exposed claim lifecycle, participants, reserves, and settlements
  • billing domain exposed accounts, invoices, delinquencies, and payment behavior
  • CRM exposed contacts, households, preferences, and interaction history
  • digital domain exposed authenticated sessions, portal actions, and consent records

A separate assembly team owned customer relationship assembly as a business capability, not as a generic data function. They built:

  • identity resolution across party, household, account, and portal identities
  • a temporal model handling late-arriving claims and billing corrections
  • survivorship rules for contact details
  • confidence scoring for matches
  • exception queues for ambiguous commercial relationships

Kafka was used to stream high-value updates from digital, billing, and policy services, while batch and CDC fed slower legacy systems. The assembled customer relationship view was published both into the analytical platform and as an operational API for call-center applications.

What changed was not just architecture. It was the conversation.

Instead of endless debate over “the one true customer table,” teams could ask:

  • which domain owns this fact?
  • how is it reconciled into the assembly?
  • what confidence do we have in the match?
  • what is the latency and correction window?
  • which consumer needs local truth versus assembled truth?

That is a healthier enterprise language. It lowers the temperature and raises the precision.

Operational Considerations

Good architecture dies in operations if you ignore the mechanics.

Data contracts

Every domain product needs explicit contracts: schema, semantics, SLAs, quality thresholds, retention, and change policy. Event contracts in Kafka matter even more because casual schema evolution can break downstream assembly logic in subtle ways.

Observability

You need pipeline and semantic observability. Not just whether the job ran, but whether reconciliation rates changed, match confidence dropped, nulls spiked, or latency breached acceptable windows.

Lineage

Lineage should show not only physical data movement but semantic derivation. Consumers must be able to trace an assembled field back to source domains and reconciliation rules.

Quality and exception handling

Not every discrepancy is a defect. Some are valid business disagreement. Quality rules should distinguish malformed data from unresolved semantics. Exception queues are not glamorous, but they are often where trust is won.

Security and privacy

Assembled products often create new privacy risk because they concentrate sensitive information. Consent, purpose limitation, masking, and row-level access matter more in the assembly layer than in isolated source products.

Cost management

Assembly products can multiply storage and compute if every use case creates its own derivative copy. Be disciplined. Assemble for durable business capabilities, not every temporary analysis.

Tradeoffs

No architecture worth using comes without tradeoffs.

The biggest gain here is semantic clarity and scalable ownership. The cost is that you no longer get to pretend integration is simple. You have to model reconciliation explicitly. You need stronger product ownership. You need governance that is federated but real.

There is also a tension between domain autonomy and enterprise consistency. Push too far toward autonomy and your platform becomes a bazaar of incompatible products. Push too far toward central consistency and you recreate a warehouse dictatorship with a modern UI.

Another tradeoff is speed versus confidence. Streaming assembly through Kafka can produce fast enterprise views, but some domains only settle truth after batch corrections or human review. Real-time is useful; fake certainty is not.

And there is a modeling tradeoff: source-aligned products preserve local truth, while assembled products increase enterprise usability. You need both. Too much emphasis on source alignment leaves business users doing the integration themselves. Too much emphasis on assembly risks hiding nuance and overfitting to current use cases.

Failure Modes

I see the same failure modes repeatedly.

1. Canonical model overreach

Teams define an enterprise-wide canonical model before they understand the domains. It becomes either empty abstraction or bureaucratic straightjacket.

2. Platform team semantic capture

The central platform team starts owning enterprise concepts because “someone has to.” Eventually they become the bottleneck and the semantic debt sink.

3. Event-driven confusion

Kafka topics are mistaken for business truth. Raw events are useful, but they are not assembled meaning. Consumers then reinvent reconciliation in ten different ways.

4. Hidden manual reconciliation

The formal platform says one thing; operations teams quietly maintain spreadsheets and service desk procedures to correct reality. If you don’t surface those manual steps, your architecture is fiction.

5. No exception model

Architects assume every conflict can be solved by deterministic rules. It can’t. Ambiguity exists. Your design needs a place for it to live.

6. Consumer-blind assembly

Assembly products are created without a clear decision context. The result is either too generic to help or too custom to reuse.

A useful test is simple: when numbers disagree, can your architecture explain why in business terms? If not, it is not mature.

When Not To Use

Domain assembly topology is not a universal hammer.

Do not use it if your landscape is small, your semantics are simple, and a conventional analytical model will do. A mid-sized SaaS company with one main application database and a handful of reporting needs probably does not need a formal assembly layer.

Do not over-engineer it for exploratory analytics. Some questions are temporary and should remain temporary. Not every cross-domain query deserves a durable product.

Avoid it when domain ownership is fictional. If the organization has not actually empowered teams to own data semantics and quality, then a distributed model will devolve into finger-pointing. In that case, first fix operating model and accountability.

And do not force assembly for highly standardized domains where canonical standards already work well, such as some financial ledger structures or regulated reporting taxonomies. There the semantics may be stable enough that a shared model is the right answer.

The pattern is most valuable where multiple bounded contexts must be composed repeatedly for meaningful enterprise decisions.

This approach sits near several other useful patterns.

  • Data mesh: domain data products and federated governance are aligned with this topology, but domain assembly adds a clearer stance on cross-domain composition.
  • CQRS and materialized views: assembly products often behave like enterprise materialized views derived from multiple command-side domains.
  • Event sourcing: useful when reconstructing state histories, but not sufficient by itself for cross-domain reconciliation.
  • MDM: master data management can help with identity and survivorship, but it should serve assembly, not replace all domain semantics.
  • Strangler fig migration: the right migration approach for moving off monolithic warehouse logic and brittle integration layers.
  • Bounded contexts from Domain-Driven Design: the conceptual foundation. Without them, assembly becomes accidental and political.

If you like a cleaner slogan, here it is: bounded contexts produce facts; assembly produces enterprise meaning.

Summary

A modern data platform should not be built as a giant central bucket with transformation attached. That model collapses under the weight of enterprise semantics. Nor should it devolve into a free-for-all of independently published data products with no serious composition logic.

The better path is to recognize that the platform’s main job is assembly.

Use domain-driven design to preserve local meaning inside bounded contexts. Create source-aligned domain data products owned by the teams closest to the business reality. Build explicit assembly products for enterprise concepts that matter to decisions. Treat reconciliation as a first-class architectural capability, not an ETL afterthought. Use Kafka and microservices where they help propagate facts quickly, but do not confuse velocity with truth. Migrate with a strangler approach, running parallel assemblies and measuring divergence until confidence justifies cutover.

The memorable line is this: your enterprise does not run on data; it runs on composed meaning.

Architect for that, and the platform becomes a system of understanding.

Ignore it, and you will keep building faster pipes into semantic chaos.

Frequently Asked Questions

What is a data mesh?

A data mesh is a decentralised data architecture where domain teams own and serve their data as products. Instead of a central data team, each domain is responsible for data quality, contracts, and discoverability.

What is a data product in architecture terms?

A data product is a self-contained, discoverable, trustworthy dataset exposed by a domain team. It has defined ownership, SLAs, documentation, and versioning — treated like a software product rather than an ETL output.

How does data mesh relate to enterprise architecture?

Data mesh aligns data ownership with business domain boundaries — the same boundaries used in domain-driven design and ArchiMate capability maps. Enterprise architects play a key role in defining the federated governance model that prevents data mesh from becoming data chaos.