Data Products Without SLAs Are Just Promises

⏱ 20 min read

A data product becomes real the moment somebody makes a decision with it and suffers when it is wrong.

That is the line many organizations miss.

They launch a “customer 360” dataset, a “gold” sales mart, a “trusted” inventory feed, and speak about them with the language of platforms and products. Yet ask a simple operational question — what happens if this is late, partial, duplicated, stale, or semantically inconsistent? — and the room goes quiet. There is no service level objective, no ownership model, no failure budget, no reconciliation discipline, no explicit topology of dependencies. There is, in truth, no product. There is only a promise.

And enterprise landscapes are full of such promises. A Kafka topic said to be canonical but with no schema governance. A lakehouse table consumed by finance and supply chain with no freshness commitment. A microservice exposing an API that is “usually correct” because it joins four upstream systems and hopes they agree. We call these data products because the term sounds modern. But if the reliability contract is absent, the architecture is decorative. The business still runs on rumor. microservices architecture diagrams

This is where reliability topology matters. Reliability is not a nice property attached to a single dataset or endpoint. It is a shape. It flows through domains, dependencies, event streams, APIs, batch windows, reference data, reconciliation loops, and human operating procedures. If you do not model that shape explicitly, the organization will discover it the hard way: in quarter close, in a customer compensation event, in a regulatory filing, or in the warehouse at 3 a.m. when inventory says one thing and the shelves say another.

The central argument is simple and opinionated: a data product should be treated as a domain asset with explicit semantics and explicit service guarantees, designed within a reliability topology that makes dependency risk visible and governable. Without that, “data mesh”, “streaming platform”, and “self-serve analytics” are just better packaging for old ambiguity.

Context

Most enterprises now operate with some mix of transactional systems, microservices, event streams, analytical stores, and machine learning features. They also carry history. ERP platforms from another era. CRM customizations no one wants to touch. Mainframe records that remain, stubbornly, the legal system of record. Around this, teams build new services and data pipelines, often with Kafka as the event backbone and cloud warehouses or lakehouses as the analytical substrate. event-driven architecture patterns

In this environment, the phrase data product emerged for good reason. It was a reaction against the anonymous swamp of centrally owned datasets with unclear meaning and no accountable team. Domain-driven design influenced the conversation. If a domain team owns Order, Customer, Inventory, Policy, Claim, Account, or Shipment as a bounded context, then perhaps it should also own the operational and analytical data assets that represent those concepts. The idea is healthy. It aligns ownership with semantics.

But ownership alone is not enough.

A bounded context tells you what a term means inside a model. It does not tell you what reliability a downstream consumer can expect. “Order Shipped” may be a meaningful event in Fulfillment, but if it arrives out of sequence, without a guarantee on latency, and without a correction protocol when warehouse scans are replayed, then downstream Billing and Customer Service cannot safely automate against it. They can consume it, certainly. They just cannot trust it.

That gap between semantic ownership and operational trust is the architecture problem.

Problem

Organizations often build data products as if publication were the finish line.

A team produces a topic, a table, or an API. It has fields. It may even have documentation. There is often a schema registry, perhaps some lineage metadata, and optimistic language about discoverability. What it usually lacks is a service contract robust enough for enterprise use.

Several symptoms appear again and again:

Freshness is implied, not committed.
Completeness is unknown.
Semantics drift between producer and consumer.
Backfills create duplicates or reorder events.
Reconciliation is an emergency activity rather than a design feature.
Multi-hop dependency chains make the downstream SLA impossible to infer.
Incidents are treated as pipeline failures rather than product failures.

This is especially dangerous with event-driven architectures. Kafka is excellent at moving facts quickly. It is less excellent at magically creating shared meaning. Teams publish events named after business concepts, but many are really internal state transitions in disguise. Consumers assume business finality where only process progress exists. An event called InvoicePosted may mean “accepted by billing service,” while Finance hears “booked and final.” This is not a technical bug. It is a domain semantics bug, and those are the expensive ones.

A second problem is topology blindness. A data product that depends on one transactional service, one master-data feed, two derived Kafka streams, and a nightly correction job is not a single thing. It is a chain of failure opportunities. Yet many catalogs present it as a neat endpoint with an owner and a description. The consumer sees a product page. The operator sees a Rube Goldberg machine. The architecture should show the latter, not hide it.

Forces

A serious architecture has to work with the forces that shape the system, not the ones we wish we had.

1. Domain semantics are local; enterprise decisions are cross-domain

DDD teaches us to respect bounded contexts. Good. It prevents the false comfort of one giant enterprise data model. But executives, finance, operations, and regulators do not live within one bounded context. They ask cross-domain questions: margin by fulfilled order, stock on hand by promised shipment date, claims paid by policy status. Data products must preserve local semantics while still being safe to compose.

That means the interfaces between domains matter more than the internal elegance of any one model.

2. Reliability degrades across hops

If Customer Profile depends on CRM events, identity resolution, consent data, and preferences from a separate service, its actual reliability is constrained by the weakest important upstream, not the confidence of the publishing team. Latency compounds. Missing records compound. Ambiguity compounds.

A product with five upstream dependencies and no explicit decomposition of critical paths is a confidence trick.

3. Batch and streaming coexist

Most enterprises are not pure streaming shops. Nor should they be. Some domains need sub-second events. Others need daily completeness and carefully controlled reconciliation. A good reliability topology accepts both. It does not pretend every business truth should be represented as a Kafka topic. Sometimes a batch close is the right semantic boundary.

4. Corrections are normal

Late-arriving facts, source-system reversals, duplicate messages, and reference data changes are not edge cases. They are the weather. Architectures that treat correction flows as exceptions become brittle and political. Teams spend more time debating “source of truth” than designing correction paths.

5. The business funds outcomes, not elegance

You can build beautiful event streams and model every domain with care. If nobody can say whether the finance dashboard is complete by 7 a.m., or whether shipment ETA data is within 15 minutes for 95% of orders, then the architecture has failed in business terms.

Solution

The solution is to define data products as domain-owned, reliability-specified interfaces that sit within an explicit reliability topology.

That sounds formal. It should. This is where architecture earns its keep.

A proper data product needs five things:

Clear domain semantics

- What business concept does this represent?

- In which bounded context is that meaning authoritative?

- What does each key event or attribute actually mean?

Consumer-oriented service guarantees

- Freshness, availability, completeness, quality thresholds, and correction behavior.

- Not all products need aggressive SLAs. But all serious products need explicit expectations.

Topology visibility

- Which upstream systems, streams, and jobs materially affect reliability?

- Which dependencies are critical path versus enrichments?

Reconciliation design

- How are drift, duplicates, and late facts detected and repaired?

- What is the periodic truth-restoring mechanism?

Operational ownership

- A named team with production duties, not just publishing rights.

- If consumers page no one, they trust no one.

In practice, I recommend thinking of data products in three reliability classes:

Operational products: power automation or customer-facing workflows. Need strict latency and correctness controls.
Decision-support products: inform planning, reporting, dashboards. Need completeness and predictability more than ultra-low latency.
Exploratory products: useful for analysis, experimentation, feature discovery. Can have weaker guarantees, but should say so plainly.

This classification is liberating. It prevents overengineering where a loose promise is acceptable, and underengineering where a business process is at stake.

Architecture

A reliability topology is a map of how trust is produced, not just how data flows.

At its core, the pattern separates domain publication from enterprise consumption through contracts, reliability metadata, and reconciliation loops.

The crucial move is that Order Status Data Product is not merely a joined table or topic. It is a managed interface with:

semantic contract
reliability objectives
known dependency chain
correction policy
versioning policy

The producer team is not promising perfection. It is promising behavior under normal conditions and a defined response when reality is messy.

Domain semantics first

Before discussing pipelines, ask semantic questions.

What exactly is an order status? Is it a projection of the ordering process? A downstream best-known state across ordering and fulfillment? A customer-visible state? These are not interchangeable.

If you build an “enterprise order status” by casually combining microservice events, you are creating a new model. That model needs its own bounded context, or at least an explicit published language. Otherwise every consumer will interpret it through its original source meaning, and you will spend months in argument. Domain language decays faster than data.

A good data product specification should include:

authoritative definition
business invariants
key lifecycle states
acceptable ambiguity
correction semantics
known exclusions

Reliability as part of the interface

For each product, define service objectives in terms consumers care about:

freshness: 95% of updates visible within 5 minutes
completeness: daily close reaches 99.95% source-record completeness by 06:00
accuracy proxy: schema-valid and referentially valid records above 99.9%
correction: late facts are back-applied within 24 hours
retention and replay guarantees
incident communication expectations

A useful line: availability without correctness is theater. A dashboard can be up while the numbers are wrong. An API can return instantly while being semantically stale. Reliability must cover the integrity of meaning, not just uptime.

Reconciliation is not optional

In enterprise systems, eventually consistent does not mean eventually correct by accident. It means correctness is restored by design.

Reconciliation should exist at multiple layers:

stream-level duplicate and ordering checks
aggregate-level count and value balancing
business-level truth comparison against source systems of record
exception queues for records needing manual or rule-based resolution

This combination of event flow plus periodic snapshot or CDC-based reconciliation is often the winning pattern. Streaming gives timeliness. Snapshot-based comparison gives confidence. Enterprises need both because business truth is rarely clean enough to trust either one alone.

Kafka and microservices, used with discipline

Kafka is particularly valuable when the domain event is stable and meaningful: payment authorized, shipment dispatched, claim registered, account closed. It is less valuable when used to externalize every internal state mutation as if consumers should infer business truth from implementation chatter.

A rule of thumb: if a consumer must understand your microservice internals to interpret your event stream, you did not publish a data product. You leaked a component.

For reliability, keep these practices:

publish versioned domain events, not ORM deltas
use schema compatibility governance
preserve idempotency keys
support replay with documented side effects
isolate enrichment dependencies so noncritical enrichments do not fail the core product
maintain a compacted canonical state stream where appropriate
separate “fact happened” events from “current projection” products

Reliability topology as an architectural artifact

Document the dependency shape visibly. Distinguish:

hard dependencies: without them the product is invalid
soft dependencies: enrichments that can be degraded
control dependencies: schema registry, orchestration, metadata services
truth dependencies: sources used for reconciliation or legal finality

Reliability topology as an architectural artifact

This sounds mundane. It is not. Once the topology is explicit, teams can make sensible promises. They can say, for example, that the core order status remains available without customer segmentation enrichment, but not without fulfillment confirmation. That is architecture translating uncertainty into operationally useful shape.

Migration Strategy

Most organizations cannot stop and redesign everything around product SLAs. Nor should they. The right migration is progressive and mildly ruthless.

Use a strangler approach.

Start by identifying high-value, heavily consumed data assets that are currently “promises.” Reporting feeds for finance, supply chain visibility, customer profile marts, or risk indicators are common candidates. Choose one where ambiguity creates visible pain.

Then move in stages.

Stage 1: Product framing

Name the product, assign ownership, define consumers, and write the semantic contract. Do not begin with platform work. Begin with language and responsibility.

Stage 2: Observe the current reliability topology

Map the actual dependencies, including ugly batch jobs and spreadsheet-fed reference tables if necessary. Architects often discover that the “real-time” product relies on a nightly correction script written by someone who left two years ago. Better to know.

Stage 3: Add reliability instrumentation

Measure freshness, lag, completeness, duplicate rates, null anomalies, and reconciliation drift. At this stage you are not improving the product yet; you are making its current truth visible.

Stage 4: Introduce a product façade

Expose a controlled API, topic, or table with versioned contract while retaining the old pipelines behind it. This is the strangler seam. Consumers begin shifting to the façade even while internals remain messy.

Stage 5: Separate core from enrichment

Refactor the build so hard dependencies drive a minimum viable reliable product, and enrichments are layered with degraded modes. This often yields the fastest reliability gains.

Stage 6: Implement reconciliation loops

Bring in periodic snapshots, CDC comparisons, or source-balancing controls. Define correction pathways and exception handling. This is where “eventually consistent” becomes “eventually accountable.”

Stage 7: Replace fragile internals incrementally

Migrate legacy ETL, bespoke joins, and uncontrolled extracts toward event-driven or CDC-fed pipelines, one dependency at a time, without changing the consumer contract.

A migration like this works because it does not ask the enterprise to adopt a new religion. It asks teams to make one important product honest, then another.

Enterprise Example

Consider a global retailer trying to build a near-real-time Available-to-Promise Inventory data product.

This is a classic enterprise trap. Sales channels want one number: how many units can we confidently sell? But the answer spans multiple bounded contexts:

Inventory management knows stock positions.
Warehouse management knows picks, packs, and damages.
Order management knows reservations and cancellations.
Purchasing knows inbound receipts.
Merchandising knows product hierarchies and substitutions.

The first attempt is usually a heroic integration. Kafka streams collect stock changes, reservations, and receipts. A microservice aggregates them into an ATP topic and a reporting table. Everyone cheers. Then Black Friday arrives.

Failure appears in familiar ways:

reservation release events are delayed
returns are posted late from stores
warehouse damage adjustments come in bursts
product substitution rules change mid-day
one region backfills missed messages and duplicates reservations

The ATP number remains available. It is just wrong enough to hurt.

A better architecture treats ATP as a data product with an explicit reliability class. It is operational. Therefore:

core ATP excludes noncritical enrichments
channel-specific confidence flags are added
freshness SLO is published by region
reconciliation runs every 30 minutes against warehouse and OMS snapshots
discrepancies above threshold cause channels to shift from “sell confidently” to “sell conservatively”
a daily final ATP ledger supports finance and supplier settlement

Notice the design choice: there is no single magical truth. There is an operational truth with service guarantees and a reconciled financial truth with stronger completeness guarantees. Different consumers, different semantics, different reliability contracts. That is not inconsistency. That is mature architecture.

The retailer also used a strangler migration. Existing nightly inventory marts were not ripped out. Instead, the ATP façade first served only digital channels in two regions, while legacy replenishment reports continued using batch figures. Over time, reconciliation confidence improved, upstream event quality improved, and more consumers moved to the product. The migration succeeded because trust expanded gradually. Enterprises do not adopt truth all at once.

Operational Considerations

A product with an SLA changes operations.

First, the owning team must run it like a service. That means alerts on business-level indicators, not just CPU and failed jobs. Monitor:

end-to-end lag from source fact to published product
source-to-product record count divergence
semantic anomalies, like impossible state transitions
percentage of records missing critical reference attributes
replay and backfill effects
correction backlog

Second, incident handling must be consumer-aware. If a data product breaches freshness but remains complete by daily cut-off, some consumers may tolerate it. Others may not. Runbooks should distinguish degradation modes. “Data delayed but safe” is different from “data current but untrusted.”

Third, versioning matters. Changes to semantics are not mere schema evolution. Renaming a field is easy. Changing what “active customer” means is a business event that deserves versioned publication and migration support.

Fourth, governance should be thin but firm. Central architecture should not own every product. It should set minimum reliability standards, taxonomy, and review for high-impact products. Domain teams should own the actual commitments.

Tradeoffs

This approach is not free.

The first tradeoff is speed versus explicitness. Teams can publish data faster if they avoid semantic debates and SLA commitments. They can also create a larger downstream cost that someone else will pay later. Architecture is often the art of slowing down the first release so the enterprise can move faster for the next five years.

The second tradeoff is autonomy versus consistency. Domain teams should own their products, but without enterprise reliability standards the result is a marketplace of incompatible promises. Too much central control creates bottlenecks. Too little creates entropy.

The third tradeoff is streaming elegance versus reconciliation discipline. Architects sometimes fall in love with pure event-driven flow. Real enterprises need correction loops, snapshots, and periodic balancing. That looks less glamorous on a conference slide and works better in production.

The fourth tradeoff is cost. Rich observability, replay support, data quality checks, exception workflows, and reconciliation stores all add platform and operating expense. If the product is low value or exploratory, that cost may not be justified.

That is exactly why reliability classes matter.

Failure Modes

There are several predictable ways this pattern goes wrong.

Semantic overreach

A team declares its product “enterprise canonical” when it only reflects one bounded context. Consumers infer more universality than the model supports. Misuse follows.

SLA vanity

Teams publish attractive SLAs they cannot sustain because the topology underneath is unstable. This is common when downstream commitments are set before dependency analysis.

Hidden manual controls

The product appears automated, but quality is preserved by undocumented analyst intervention, spreadsheet fixes, or ad hoc SQL. Eventually the humans are unavailable and trust collapses.

Reconciliation theater

A reconciliation process exists, but only as a report no one acts on. Drift is measured and ignored. A reconciliation mechanism without correction ownership is bookkeeping, not reliability.

Overcoupled consumers

Consumers depend on incidental fields, inferred meanings, or producer timing quirks. Even with a good contract, they bind themselves to accidental behavior and blame the platform when it changes.

“Real-time” absolutism

Everything is forced into low-latency pipelines, including data that should settle in batch with stronger completeness. The result is expensive instability pretending to be modernity.

When Not To Use

Do not wrap every dataset in this machinery.

If a dataset is exploratory, internal to one team, or low-consequence, a lighter product model is fine. Document it honestly as best-effort. Not every data asset needs formal SLOs, reconciliation engines, and topological dependency mapping.

Do not use this pattern where semantics are still deeply unsettled. If the business cannot yet agree on what “customer profitability” means, codifying a strict product SLA can calcify the wrong model. In such cases, prototype first, then productize once the language stabilizes.

Do not pretend a data product can solve organizational irresponsibility. If no domain team is willing to own production support, no amount of platform tooling will create trust.

And do not confuse legal system of record with operational data product. Sometimes the right answer is still to query the authoritative transactional system for a final decision, especially in regulated or financial scenarios.

Several patterns sit naturally alongside this one.

Domain events: useful when events represent stable business facts.
CDC pipelines: strong for migration and reconciliation, especially where legacy systems cannot publish clean events.
CQRS projections: helpful when separating write models from read-oriented product views.
Data mesh: useful as an organizational model, but only when paired with actual reliability contracts.
Strangler fig migration: essential for replacing legacy feeds progressively without consumer chaos.
Outbox pattern: improves event publication reliability from transactional services.
Golden record / MDM: relevant for certain master domains, though often overused as a universal answer.
Lakehouse medallion layers: useful internally, but not a substitute for product semantics or SLAs.

A final note on “source of truth.” Enterprises misuse the phrase constantly. Truth is contextual. Authority belongs to bounded contexts for specific meanings. Reliability belongs to interfaces. Reconciliation belongs to operations. If you separate those concerns, the architecture becomes calmer and much more honest.

Summary

A data product without an SLA is not a product. It is a hope wrapped in metadata.

The enterprise architecture task is to turn that hope into a managed interface with clear domain semantics, explicit reliability objectives, visible dependency topology, and built-in reconciliation. That is what makes downstream automation safe, reporting dependable, and cross-domain decision-making sane.

The key ideas are straightforward:

use domain-driven design to anchor meaning
classify products by reliability need
make dependency topology explicit
separate core facts from enrichments
design reconciliation as a first-class capability
migrate progressively using strangler seams
treat Kafka and microservices as tools, not proof of modernity

In the end, reliability topology is simply the map of where trust comes from and where it can fail. Good architects draw that map before the business trips over it. Bad architects call the result “self-serve” and wait for the incident.

The difference is not technical sophistication. It is whether we are willing to make promises we can actually keep.

Frequently Asked Questions

What is a data mesh?

A data mesh is a decentralised data architecture where domain teams own and serve their data as products. Instead of a central data team, each domain is responsible for data quality, contracts, and discoverability.

What is a data product in architecture terms?

A data product is a self-contained, discoverable, trustworthy dataset exposed by a domain team. It has defined ownership, SLAs, documentation, and versioning — treated like a software product rather than an ETL output.

How does data mesh relate to enterprise architecture?

Data mesh aligns data ownership with business domain boundaries — the same boundaries used in domain-driven design and ArchiMate capability maps. Enterprise architects play a key role in defining the federated governance model that prevents data mesh from becoming data chaos.