⏱ 20 min read
Most data platforms do not fail because they lack data. They fail because nobody can find the right data, trust what they find, or understand what it means before the meeting ends and the decision gets made anyway.
That is the dirty secret behind many so-called modern data estates. Companies pour money into lakes, warehouses, catalogs, event streams, semantic layers, AI tooling, and governance programs. Then a product team asks a simple question — which customer data product should I use for churn prediction in Spain? — and the answer arrives as folklore. “Talk to Maria.” “There’s a table in Snowflake.” “Maybe the CRM team publishes that to Kafka.” “Ignore the old dashboard; finance doesn’t trust it.” At that point, architecture has already failed.
In a data mesh, this gets sharper, not softer. A mesh creates the conditions for scale by distributing ownership to domains. But distributed ownership without discoverability is just a polite way of producing chaos at departmental speed. If every domain publishes data products but nobody can reliably discover, evaluate, and adopt them, the mesh becomes a marketplace where every stall has its own sign, currency, and opening hours.
Data product discovery is therefore not a side feature. It is one of the load-bearing walls of data mesh architecture. Not glamorous. Not usually the headline in vendor decks. But essential. enterprise architecture with ArchiMate
This article looks at data product discovery as an enterprise architecture problem rather than a catalog feature. We will examine the forces behind it, the role of domain semantics, the supporting architecture, migration strategy, operational concerns, and the places where this pattern breaks down. I’ll also argue for a view that many teams resist at first: discovery is not just search. It is the combination of semantics, contracts, trust signals, lineage, access paths, and organizational intent.
If you get this right, teams can move from “where is the data?” to “is this the right product for my use case?” That is a profound shift. The first is scavenging. The second is engineering.
Context
Data mesh emerged as a reaction to the familiar collapse of centralized data platforms. A single team, usually called data engineering or analytics engineering or something equally overloaded, becomes responsible for ingesting, modeling, governing, serving, and explaining data for the whole enterprise. For a while it works. Then the business expands, digital channels multiply, regulations tighten, source systems proliferate, and the central team becomes a ticket queue with a logo.
The mesh idea is compelling because it matches enterprise reality. Sales understands sales data. Claims understands claims. Supply chain understands inventory movements. Fraud understands fraud signals. These are not just datasets. They are domain concepts with histories, business constraints, and political borders. Domain-driven design has taught us for years that language matters, boundaries matter, and ownership matters. Data mesh simply applies that lesson to analytical and operational data at scale.
But there is a catch.
Once domains own and publish data products, the enterprise now has a portfolio of independently evolving products. They may be exposed through tables, files, APIs, Kafka topics, feature stores, semantic views, or governed query endpoints. They may have different service levels, freshness guarantees, schemas, and legal restrictions. Consumers now face abundance instead of scarcity. That sounds better. It often is not. event-driven architecture patterns
A central warehouse with poor modeling gives you one type of pain: bottlenecks. A mesh without discovery gives you another: fragmentation. In both cases, delivery slows, trust erodes, and people build spreadsheets in self-defense.
So discovery becomes the connective tissue of the mesh. Not a monolithic portal. A capability.
Problem
Most organizations treat discovery as a metadata catalog problem. They buy a catalog, crawl technical metadata, index some schemas, maybe attach glossary terms, and call it done. Then they wonder why business teams still ask Slack channels for the “real” revenue table.
The reason is simple. Discovery is not the same as inventory.
An inventory tells you what exists. Discovery tells you what is fit for purpose.
To discover a data product, a consumer needs answers to practical questions:
- What business concept does this represent?
- Which domain owns it?
- What is the bounded context?
- Is it authoritative or derived?
- What are the quality guarantees?
- How fresh is it?
- What access method should I use — SQL, event stream, API, file export?
- What policy constraints apply?
- What are the downstream dependencies?
- If this breaks, who gets paged?
- If I use it, what semantics am I inheriting?
Those are not technical footnotes. They are adoption criteria.
Worse, data products rarely fail discovery in obvious ways. The schema may be present. The documentation may exist. The issue is semantic ambiguity. “Customer” means prospect in one domain, active account in another, legal entity in a third, billing parent in a fourth. “Order” could mean cart submission, payment authorization, ERP booking, or shipped fulfillment line. A search engine cannot fix that. A domain model can.
This is why data product discovery lives squarely in the territory of domain-driven design. A mesh succeeds when data products are discoverable through business language grounded in bounded contexts. If not, every domain exports its own worldview and calls it reusable. That is not interoperability. That is semantic leakage.
Forces
A good architecture appears when you face the forces honestly. Here, the forces pull in opposite directions.
Domain autonomy vs enterprise consistency
Domains need the freedom to publish products in ways that suit their workflows and technologies. Sales may expose curated account snapshots in a warehouse. Fulfillment may publish shipment events via Kafka. Risk may serve feature vectors through an online store. But consumers need some consistent way to understand and compare these products across the enterprise.
Too much standardization kills autonomy. Too little creates a bazaar.
Speed of publication vs trustworthiness
You want domains to publish products quickly. But if they can publish without quality metrics, lifecycle state, ownership details, or policy labels, discovery becomes a junk drawer. Search results fill with half-finished artifacts, abandoned tables, and duplicate “gold” datasets. Teams stop trusting the discovery platform and go back to personal networks.
Trust is earned through friction in the right place.
Technical metadata vs business semantics
Automated scanners are good at collecting schemas, lineage, partitions, query history, and freshness. They are terrible at answering whether a product represents “net revenue recognized under IFRS 15” or “customer household for anti-money-laundering exposure.” Yet those are often the decisive questions.
You need machines for scale and people for meaning.
Central governance vs federated ownership
Regulation, security, and audit demand central control points. Data mesh demands federated accountability. Discovery sits right in the middle. It must expose usage policies, retention rules, and sensitivity classifications while still reflecting domain ownership and local semantics.
This is where many mesh programs lose their nerve and quietly rebuild central command.
Real-time discovery vs stable curation
If products are changing daily across topics, tables, contracts, and APIs, discovery information can drift. Real-time metadata sync sounds attractive, but a fully dynamic catalog without curation often turns into noise. Conversely, curated approval workflows can become stale and bureaucratic.
The right answer is usually a layered model: automate collection, human-curate meaning, compute trust continuously.
Solution
A robust data product discovery architecture combines four ideas:
- Data products as first-class domain artifacts
- A federated discovery plane
- Semantic descriptions aligned to bounded contexts
- Trust signals and policy-aware access embedded in discovery
The core idea is opinionated: do not build discovery around datasets. Build it around data products. A dataset is a storage artifact. A data product is an intentional thing with an owner, consumers, semantics, guarantees, and interfaces.
That means a discoverable product should include, at minimum:
- domain and subdomain
- bounded context
- business description
- product classification: source-aligned, aggregate, consumer-aligned, or shared/master
- interfaces: table, stream, API, file, semantic layer
- schema and contract versions
- quality indicators
- freshness and SLA/SLO
- lineage and dependencies
- security classification and access policy
- lifecycle stage: experimental, active, deprecated, retired
- owner and support channel
- usage examples and known consumers
This is not busywork. This is how the enterprise knows what it is looking at.
A useful mental model is this: discovery is the storefront, but also the receipt, the ingredient label, and the recall notice. If all you have is the storefront, buyers get burned.
Discovery as a federated platform capability
The platform team should provide discovery infrastructure, standards, APIs, and UX. Domains should provide content, semantics, and stewardship. That split matters. It mirrors a healthy platform model in data mesh: the platform makes the right thing easy; domains remain responsible for the meaning of what they publish.
Here is a high-level discovery architecture.
Notice what is absent: a giant central team manually maintaining every description. That model dies under enterprise scale. The discovery plane must ingest technical metadata automatically, expose self-service publishing, and require domains to own semantic descriptions and product contracts.
Domain semantics first
This is where domain-driven design earns its keep.
A discovery platform should not merely tag assets with glossary terms. It should expose bounded contexts, domain vocabularies, aliases, canonical terms where appropriate, and explicit semantic differences where terms diverge.
For example:
- In Retail Sales, “customer” may refer to a shopper identity.
- In Billing, “customer” may refer to the invoiced legal entity.
- In Service Operations, “customer” may refer to the account holder tied to support entitlements.
A good discovery experience should not flatten these into one magical enterprise customer concept unless the organization truly has one. Most don’t. Better to show relationships and context boundaries than to fake semantic unity.
That is one of the hard truths in enterprise architecture: pretending the business is simpler than it is will not make the systems simpler. It just makes the misunderstandings more expensive.
Architecture
A practical discovery architecture usually has six layers.
1. Product registration and contracts
Domains register products through APIs or CI/CD pipelines. Registration includes machine-readable contracts for schemas, interfaces, ownership, lifecycle state, and service expectations. For Kafka topics, this may integrate with schema registry and event contract definitions. For warehouse products, it may connect to dbt metadata, table contracts, and semantic models.
2. Metadata ingestion
Automated collectors pull technical metadata from warehouses, lakehouses, catalogs, Kafka clusters, API gateways, lineage tools, orchestration systems, and access-control systems. This keeps the discovery plane current without human drudgery.
3. Semantic layer
This is the underbuilt part in most enterprises. A semantic layer for discovery is not only a BI semantic model. It is the set of domain concepts, relationships, business definitions, bounded-context markers, synonyms, and references to canonical business language where it exists. Some organizations use a business glossary; better ones connect glossary terms directly to data products and contracts.
4. Trust and governance layer
Discovery must expose quality scores, freshness, incident history, certification status, policy labels, and approved usage patterns. If a product contains PII restricted to a lawful purpose, consumers should see that before requesting access, not after integrating it into an application.
5. Experience layer
Consumers need multiple paths:
- search by business term
- browse by domain
- browse by business capability
- browse by product type
- browse by event stream vs analytical dataset
- query by policy or region
- lineage-driven discovery from a known report or API
A single search box is not enough.
6. Feedback and telemetry
A discovery system should learn from usage: failed searches, abandoned access requests, products with high bounce rates, heavily reused assets, duplicate products, and stale documentation. Discovery itself is a product. Products need telemetry.
Here is a more detailed view.
Migration Strategy
No serious enterprise starts with a clean-sheet mesh. There is always a warehouse, a lake, a reporting stack, a patchwork of APIs, and usually several “temporary” data marts old enough to vote. So the real question is not how to design ideal discovery. It is how to get there without stopping the business.
This is a classic strangler problem.
You do not rip out the central catalog and replace it with a mesh-native discovery platform overnight. You progressively wrap, reconcile, and redirect.
Step 1: Inventory what is already acting like a product
Even in centralized platforms, some datasets already behave like products. They have named owners, stable consumers, implicit SLAs, and business recognition. Start there. Product thinking often exists before product architecture does.
Step 2: Introduce a lightweight product descriptor
Create a minimal standard for product registration:
- owner
- domain
- description
- interface
- sensitivity
- lifecycle
- support contact
Make this easy enough that teams can adopt it without a program office.
Step 3: Federate descriptions before federating infrastructure
This is an important migration move. You can keep the current warehouse or lake in place while shifting semantic ownership to domains. Let domains curate product descriptions, glossary mappings, and usage guidance for the assets they already publish. This builds muscle before bigger platform changes.
Step 4: Add contract and trust metadata
Once registration exists, enrich it with schema contracts, freshness metrics, quality indicators, lineage links, and access policies. Consumers need confidence signals early.
Step 5: Redirect discovery traffic
Encourage teams to use the discovery plane as the front door even when the physical data still lives in the old platform. This matters because it decouples how people find data from where data is stored.
Step 6: Strangle legacy catalogs and ad hoc wikis
As domains mature, move searches, access requests, and support paths into the discovery platform. Retire wiki pages that duplicate metadata and decay by lunchtime.
Step 7: Expand to event products and operational interfaces
Many migrations stall because discovery focuses only on analytical tables. In a real mesh, Kafka topics, CDC streams, APIs, and feature-serving endpoints are all discoverable products. Bring them in early enough to avoid a split-brain catalog.
Reconciliation is not optional
During migration, the same business concept may exist in old and new forms. A legacy warehouse table may overlap with a domain-owned product. A Kafka stream may emit transaction events while finance publishes booked revenue snapshots. Discovery must surface these relationships clearly:
- successor/predecessor links
- equivalence or non-equivalence notes
- reconciliation rules
- authoritative-use guidance
- deprecation status
Without reconciliation, migration creates semantic duplicates and consumer confusion.
Here is a simple strangler view.
That middle box — the discovery facade — is often the smartest first investment. It lets the enterprise behave in a mesh-like way before the storage and processing architecture fully catches up.
Enterprise Example
Consider a multinational insurer. Classic enterprise terrain: policy administration systems by country, claims platforms by line of business, CRM in one suite, finance in another, and a central data lake that has become the archaeological record of every integration decision made since 2014.
The company decides to adopt data mesh around domains such as Customer, Policy, Claims, Billing, and Fraud. The first enthusiasm is predictable. Domains publish data products. Claims emits events into Kafka for claim lifecycle changes. Billing publishes invoice and payment snapshots into Snowflake. Fraud exposes scored risk features. Customer provides mastered household relationships via API and batch extracts.
Six months later, adoption stalls.
Why? Because the data exists, but nobody can reliably discover what to use. Analysts in Europe cannot tell whether “active policyholder” should come from Customer or Policy. Fraud teams consume claim events directly from Kafka and miss corrections applied later in Claims’ curated analytical product. Finance distrusts payment products because the lineage to ERP reconciliation is unclear. Three teams create separate “gold” customer tables in the warehouse because the “Customer 360” product is semantically overloaded.
The solution is not another central model. It is a disciplined discovery capability.
The insurer introduces a federated product registry. Every product must declare:
- domain
- business purpose
- authoritative scope
- interface type
- update cadence
- geographic coverage
- sensitivity classification
- owner and support rota
- quality score and reconciliation status
Crucially, they add domain semantics. “Policyholder,” “insured party,” “billing account,” and “household” are defined within contexts, not forced into a single universal customer model. Discovery shows related concepts and warns where products should not be substituted.
For Kafka products, event streams are marked as operational facts with replay and ordering properties. For analytical products, curated tables are marked as decision-grade aggregates with reconciliation windows and posting rules. This distinction reduces misuse dramatically. Teams stop treating every topic as reporting-ready.
The insurer also adds reconciliation notes between near-duplicate products:
- Claims event stream: near-real-time operational state transitions
- Claims curated fact product: finance-aligned, corrected, booked claim view
- Billing payment stream: transactional payment attempts
- Billing settled payments product: ledger-reconciled settled payments
This sounds obvious when written down. In practice, it is transformative. The discovery experience shifts from “search all assets” to “choose the right semantic contract for your purpose.”
Within a year, reuse improves, duplicate pipelines decline, and audit conversations get shorter. That last point matters more than most architects admit. Good discovery reduces not just engineering waste but organizational argument.
Operational Considerations
Discovery itself must be run like a platform product.
Freshness of metadata
If lineage is weeks old, quality scores stale, or deprecation flags missing, consumers stop trusting the system. Metadata pipelines need SLOs too.
Ownership hygiene
A product with no active owner is not discoverable in any meaningful sense. Enforce active ownership and escalation paths. Dead owners are a silent platform failure.
Access integration
Discovery without access workflow is tourism. Useful, maybe, but not productive. Integrate policy checks, request flows, entitlements, and approval automation.
Versioning
Products evolve. Discovery must show active versions, compatibility guarantees, and deprecation dates. This is particularly important for Kafka schemas and APIs where downstream breakage can cascade.
Usage analytics
Track:
- top search terms with poor results
- products frequently viewed but rarely used
- duplicated product descriptions
- rising use of deprecated assets
- policy-related access denials by domain
These are not vanity metrics. They tell you where semantics, governance, or platform ergonomics are failing. EA governance checklist
Incident visibility
If a heavily used product has repeated data quality incidents, discovery should surface that history. Trust signals must include scars, not just badges.
Tradeoffs
There is no free lunch here. Anyone selling one is probably selling a metadata crawler.
Rich semantics cost money
Capturing domain meaning requires domain experts, stewardship, and disciplined language. This is not solved by AI-generated descriptions alone. Automation helps; accountability still belongs to humans.
Standardization creates friction
Requiring product descriptors, contracts, and lifecycle states slows publication slightly. Good. A mesh without publication standards is a rumor network.
Discovery can become central control in disguise
If the platform team turns discovery into an approval chokepoint, domains will route around it. The architecture must separate required standards from excessive gatekeeping.
Trust scoring can be gamed
If product adoption depends heavily on certification badges or scores, teams will optimize the score rather than the substance. Blend automated evidence with peer review and incident data.
Semantic alignment can become theology
Some enterprises fall into endless debates about canonical definitions. Discovery should support context and relationships, not force universal truth where it does not exist.
Failure Modes
This pattern fails in predictable ways.
Catalog theater
A sleek portal, lots of indexed assets, no real semantics, stale ownership, and poor trust signals. It demos beautifully and changes nothing.
Dataset masquerading as product
Teams register raw tables as products without service commitments, business definitions, or support. Consumers get burned and stop reusing.
Centralized stewardship bottleneck
A small governance team becomes responsible for approving every product description. Publication slows to a crawl. Domains disengage.
Semantic flattening
The enterprise invents one definition for “customer,” “order,” or “revenue” and suppresses contextual variants. Discovery looks neat but misleads consumers.
Real-time everywhere syndrome
Operational event streams in Kafka are treated as universally reusable data products for every analytical need. Then replay semantics, late-arriving events, compaction, and correction logic surprise downstream consumers.
No reconciliation during migration
Legacy and mesh-era products coexist without explicit guidance. Teams compare numbers from different products and conclude the mesh is unreliable. In truth, the architecture failed to explain the difference.
When Not To Use
Data product discovery at this level is not always necessary.
Do not build an elaborate federated discovery plane if:
- you are a small organization with one or two tightly aligned data teams
- your data landscape is modest and semantically stable
- domains are not truly empowered to own products
- you still lack basic data quality, ownership, or access management
- the business has no appetite for federated governance
In those environments, a simpler catalog with clear ownership and glossary support may be enough.
Also, do not pretend to have a data mesh if every important modeling decision still goes through a central team. That is just a hub-and-spoke data platform with better branding. Discovery architecture should match organizational reality, not aspirational slideware.
Related Patterns
A strong discovery capability usually sits alongside a handful of related patterns:
- Data Product Contracts: explicit machine-readable guarantees for schemas, interfaces, and compatibility.
- Federated Computational Governance: policy enforcement distributed through platform mechanisms rather than manual review alone.
- Domain Ontology or Business Glossary: lightweight semantic structure linking domain terms to products and contexts.
- Progressive Strangler Migration: moving from centralized stores and catalogs to federated product ownership incrementally.
- Event-Carried State Transfer and CDC: relevant where Kafka and microservices publish operational data that later becomes discoverable products.
- Reconciliation Pipelines: critical for bridging operational truth, financial truth, and analytical truth during transitions.
- Golden Path Platform Tooling: templates and automation that make compliant product publication easy.
The best architectures do not treat these as isolated practices. They interlock.
Summary
Data mesh without data product discovery is like a city with roads but no addresses. Movement is possible. Delivery is not.
The point of discovery is not to help users search for data in the abstract. It is to help them find a trustworthy, semantically appropriate, policy-compliant data product that they can actually use. That requires more than technical metadata. It requires domain-driven design, ownership, contracts, quality evidence, lineage, access integration, and clear migration thinking.
The most important design choice is to model discoverable things as data products, not just assets. The most important migration choice is to introduce a discovery facade early and use it to strangle legacy find-and-ask behavior over time. And the most important semantic choice is to respect bounded contexts instead of flattening the enterprise into one fake universal vocabulary.
A good discovery platform reduces duplicate pipelines, speeds onboarding, improves governance outcomes, and lowers the social cost of finding trusted data. A bad one becomes another portal people ignore. ArchiMate for governance
In enterprise architecture, that is often the real test. Not whether the capability exists on a diagram, but whether people stop asking Slack who owns the revenue table.
If they do, you are making progress. If they don’t, the mesh is still just wiring.
Frequently Asked Questions
What is a data mesh?
A data mesh is a decentralised data architecture where domain teams own and serve their data as products. Instead of a central data team, each domain is responsible for data quality, contracts, and discoverability.
What is a data product in architecture terms?
A data product is a self-contained, discoverable, trustworthy dataset exposed by a domain team. It has defined ownership, SLAs, documentation, and versioning — treated like a software product rather than an ETL output.
How does data mesh relate to enterprise architecture?
Data mesh aligns data ownership with business domain boundaries — the same boundaries used in domain-driven design and ArchiMate capability maps. Enterprise architects play a key role in defining the federated governance model that prevents data mesh from becoming data chaos.