The Hardest Part of Data Platforms Is Trust

⏱ 20 min read

Data platforms usually fail long before the storage engine does.

Not because Kafka can’t handle throughput. Not because Snowflake, BigQuery, Databricks, or some lakehouse stack runs out of cleverness. And not because there wasn’t a reference architecture with enough boxes and arrows to impress a steering committee. They fail because the business stops trusting the numbers. Once that happens, every dashboard becomes a negotiation, every KPI gets an asterisk, and every executive meeting turns into forensic accounting.

That is the real architecture problem.

Trust is not a reporting feature. It is not something you sprinkle on with a data catalog, a lineage product, or a governance workstream. Trust is the consequence of a system that preserves meaning as data moves across operational boundaries, analytical pipelines, and organizational politics. In other words: trust is an architectural property. EA governance checklist

This is why reconciliation architecture matters so much. Reconciliation is not glamorous. Nobody gets promoted for saying, “we can now prove yesterday’s revenue total matches across six bounded contexts.” But in large enterprises, that sentence is worth more than another machine learning pilot. If you cannot reconcile the platform with the source domains, you do not have a data platform. You have a rumor distribution network.

The hard bit is that reconciliation is not just a technical pattern. It sits at the fault line between domain-driven design, integration architecture, finance controls, operational resilience, and migration strategy. A trustworthy platform has to respect business semantics, not merely transport records at scale. It has to know the difference between an order, a shipment, an invoice, a payment, and recognized revenue. Those are not columns. They are commitments made by different parts of the enterprise at different times for different purposes.

And when those commitments drift, trust dies.

So let’s talk about the architecture that keeps that from happening.

Context

Most enterprises did not design their data estate. They inherited it.

A retailer has an ERP, a CRM, a payments gateway, a returns platform, a warehouse management system, a loyalty engine, and twenty years of spreadsheets with suspicious authority. A bank has customer systems of record, product processors, digital channels, anti-fraud services, and enough nightly batch jobs to recreate the industrial revolution. A manufacturer has plant systems, planning systems, procurement systems, and a dozen vendor platforms all insisting they are the source of truth.

Then someone starts a “modern data platform” program.

The first move is usually sensible: stream events from operational systems, consolidate data in a lake or warehouse, build canonical models, and give teams self-service access. Kafka appears. Microservices are mentioned. A medallion architecture is drawn. Data products are announced. At this stage, everyone feels modern. event-driven architecture patterns

Then the questions begin.

Why does sales booked in the finance mart not match the order volume in the customer analytics mart? Why does the daily active customer count differ between marketing and service operations? Why did the refund total change three days after close? Why does one pipeline say a policy is active while another says it is cancelled? Why did a backfill quietly rewrite twelve months of history?

These are not edge cases. They are the center of enterprise reality.

The platform is being asked to unify facts that were produced by different domains with different semantics and different clocks. Operations care about current state. Finance cares about legally defensible history. Customer service cares about what the agent can act on right now. Analytics cares about trendable, stable dimensions. Fraud cares about event order and anomalies. The same real-world thing appears differently in each context because the business itself sees it differently.

That is not bad design. That is the business.

A serious architecture starts there: there is no single universal truth, only domain truths with explicit translations, and trust comes from making those translations visible, testable, and reconcilable.

Problem

Enterprises often confuse integration with agreement.

Moving data from a source system into Kafka and then into a lakehouse does not mean downstream consumers understand what the data means. CDC does not magically align business meaning. Event streams do not erase boundary mismatches. A “customer created” event from one service may not represent the same lifecycle point as “customer onboarded” in another. “Order total” might include tax in one domain, exclude tax in another, and get restated after promotions settle. “Revenue” almost never means what product teams think it means.

This is where data platforms become political. Every team can explain its own number. None can explain why the numbers disagree.

The deeper issue is that most platform designs optimize for movement, not proof. They make it easy to ingest, transform, and serve data. They do not make it easy to answer the hard questions:

What is the authoritative source for this business fact?
What domain boundary does this metric belong to?
What semantic transformations occurred?
Can totals be reconciled by day, by entity, by ledger period, by event lineage?
What changed after first publication?
Which mismatches are expected, and which indicate defect or fraud?

Without these answers, trust becomes personality-driven. People trust whichever system is backed by the loudest executive, the most experienced analyst, or the team with the cleanest dashboard design.

That is not architecture. That is folklore.

Forces

This problem is hard because several forces pull in different directions at once.

Domain semantics are local, but reporting is global

Domain-driven design teaches a lesson many data teams learn too late: meaning lives inside bounded contexts. An Order in commerce is not an Invoice in finance. A Shipment in logistics is not Fulfillment in customer service. If you flatten these distinctions into a universal schema too early, you create fake consistency. The platform looks tidy and the business becomes confused.

At the same time, the enterprise needs cross-domain reporting. The CFO wants order-to-cash visibility. The COO wants fulfillment performance by promise date. The CMO wants customer lifetime value. So architecture must preserve local meanings while enabling global views.

That requires translation, not homogenization.

Event-driven architectures improve latency, but amplify inconsistency

Kafka and microservices are useful here because they expose business events close to the source and reduce dependency on nightly extracts. But event-driven systems also introduce retries, duplicates, out-of-order delivery, schema evolution, and temporal disagreement. Real systems emit events before transactions settle, after compensations occur, or without enough context for downstream correctness.

A fast wrong number is still wrong.

Finance wants finality; operations want immediacy

Operational analytics wants low latency. Finance wants controlled restatement and auditable close. These are not the same requirement. A platform that favors freshness may publish numbers that change later. A platform that favors certainty may be too slow for operational decisions.

Trust depends on making this tension explicit. “Real-time” and “reconciled” are different states.

Migration cannot stop the business

No major enterprise gets to rebuild all systems around a clean event model. Legacy systems remain. Batch interfaces remain. Mainframes remain. Half the critical semantics live in COBOL, ETL jobs, and someone’s head. So the platform must support progressive migration. It has to strangle old pipelines without pretending they can be switched off in one release.

Data quality is not just null checks

Most data quality tools check schema, completeness, uniqueness, and value ranges. Useful, but insufficient. Enterprises fail on semantic quality: missing lifecycle transitions, impossible state combinations, double counting after replay, invalid joins across slowly changing dimensions, and metrics computed before business processes are complete.

The worst defects are usually semantically plausible.

Solution

The solution is to build the platform around reconciled domain facts, not raw centralized truth claims.

That sounds subtle. It is not.

The core idea is this: each domain publishes facts in its own language; the platform preserves those facts with lineage and time semantics; reconciliation services compare, align, and certify cross-domain views; consumers use certified data products appropriate to their purpose.

In other words, stop trying to make every downstream dataset universally authoritative. Instead, make authority contextual and reconciliation explicit.

A practical architecture usually has five layers of responsibility:

Operational source domains

Systems of record and domain microservices emit transactions, events, and snapshots according to their own bounded context. microservices architecture diagrams

Ingestion with provenance

CDC, event streaming, and batch ingestion capture source data without losing source identifiers, timestamps, version markers, or extraction lineage.

Domain-aligned persistence

Raw and refined stores preserve source semantics. This is where the platform resists the temptation to prematurely create one giant canonical model.

Reconciliation and certification

Services or jobs compare cross-domain totals, entity states, lifecycle transitions, and ledger-level balances. They produce discrepancy records, explainability artifacts, and certification status.

Purpose-built data products

Finance, operations, product, and customer analytics consume curated views with explicit semantic contracts: preliminary, operational, reconciled, restated, or official.

That certification step is what most platforms miss.

Here is a simple reconciliation architecture view.

Diagram 1 — The Hardest Part of Data Platforms Is Trust

The reconciliation engine can be implemented in several ways: SQL jobs in a warehouse, stream processors over Kafka, dedicated control services, or combinations of all three. The right choice depends on latency needs and operational maturity. But the architectural role stays the same: prove or disprove consistency against business-defined rules.

Domain semantics first

A trustworthy platform starts with explicit domain language.

For example, in order-to-cash:

Commerce owns Order Placed
Fulfillment owns Shipment Dispatched
Billing owns Invoice Issued
Payments owns Payment Captured
Finance owns Revenue Recognized

These are not phases of one universal entity. They are domain facts with causal relationships. A reconciliation architecture does not collapse them into one “sales” table and hope for the best. It preserves each fact, then defines how they relate and where they are expected to align.

This is classic domain-driven design thinking applied to data architecture. Bounded contexts matter just as much in analytics as in application design. Maybe more.

Reconciliation as a first-class capability

Reconciliation should produce machine-readable outcomes, not just monthly analyst exercises. Typical controls include:

event counts by source and target
monetary totals by business date and legal entity
state transition completeness
one-to-one or one-to-many match rates
duplicate and replay detection
orphan records
late-arriving adjustments
period-over-period variance thresholds
source-to-certified lineage proof

The output should classify records and aggregates, for example:

matched
matched with timing difference
unmatched source
unmatched target
semantically invalid
restated after certification
accepted discrepancy

That classification is gold. It turns vague distrust into manageable operational work.

Architecture

There is no single diagram that captures trust. But there are structures that make it more likely.

Logical architecture

A few opinions are worth stating plainly.

First, keep raw source capture and domain-refined models separate. Raw capture is for provenance and replay. Refined models are for semantics. Mixing the two creates confusion when backfills and schema changes arrive.

Second, do not treat Kafka as your truth layer. Kafka is a transport and temporal log. It is brilliant at moving facts and replaying streams. It is not sufficient on its own for audited trust, especially when retention, compaction, schema drift, and cross-topic joins enter the story.

Third, certified data products should carry status metadata. Consumers need to know if they are seeing provisional, reconciled, or restated data. Hiding that distinction is how trust erodes.

Reconciliation patterns

There are usually three kinds of reconciliation, and mature platforms use all of them.

1. Aggregate reconciliation

Compare totals across systems by business key and period. This is common for finance and control functions.

Example:

sum of invoice amounts by legal entity and accounting date
count of shipments by warehouse and dispatch day
total captured payments vs ledger postings

This is the cheapest place to start and catches more defects than teams expect.

2. Entity reconciliation

Match individual business entities or lifecycle instances.

Example:

every shipped order must map to a shipment event and a dispatch record
every invoice should trace back to one or more order lines
every captured payment must link to a customer account and settlement batch

This requires stronger keys, better identity resolution, and more domain knowledge.

3. Semantic reconciliation

Prove that business meaning still holds after transformation.

Example:

cancelled orders must not contribute to booked sales after a certain state
revenue cannot be recognized before fulfillment conditions are met
refunds must offset original payment logic in reporting windows

This is the most valuable and the most expensive. It is where architecture stops being plumbing and becomes business design.

Control-plane thinking

One useful pattern is to build reconciliation as a control plane over data flows rather than embedding all checks inside pipelines. Pipelines still perform local validations, but a central control capability tracks datasets, expected controls, thresholds, certification state, and exceptions.

That control plane can answer practical questions:

Which datasets failed reconciliation today?
Which failures block official reporting?
Which discrepancies are timing-related vs actual defects?
Which downstream products consumed uncertified inputs?

This is how trust scales organizationally.

Migration Strategy

You cannot big-bang your way into trust. Enterprises that try usually create two bad platforms instead of one good one.

The right migration is a progressive strangler. Start by proving trust around a narrow, high-value business flow, then expand.

Step 1: Pick a business-critical value stream

Choose something painful enough to matter and bounded enough to finish. Order-to-cash is common. Claims-to-payment in insurance is another. Trade capture to settlement in banking. Procure-to-pay in manufacturing.

Avoid “customer 360” as the first target. It is too broad, too political, and too semantically slippery.

Step 2: Preserve source semantics

Ingest legacy extracts, CDC feeds, and domain events without flattening away identifiers, timestamps, or status values. If the old ERP has ugly but meaningful lifecycle codes, keep them. You can translate later. You cannot reconstruct lost semantics.

Step 3: Build domain data products before enterprise marts

Model around bounded contexts first. This sounds slower; it is actually faster because teams stop arguing about imaginary canonical universes.

Step 4: Add reconciliation beside existing reporting

Do not replace official reports on day one. Run the new reconciled products in parallel. Compare old and new. Expose discrepancies. This is where architecture earns political capital.

Step 5: Cut over by certification domain

Retire legacy reports incrementally, not all at once. Finance close might move after three successful periods. Operations reporting might move earlier. Some domains will remain hybrid for years. That is normal.

Here is the migration shape.

Step 5: Cut over by certification domain — Cut over by certification domain

Migration reasoning

The strangler pattern works here because trust is empirical. You do not win it with architecture principles. You win it by showing that the new path can explain differences, survive close cycles, and reduce investigation effort.

A good migration KPI is not just latency or query performance. It is things like:

days to close
number of manual reconciliation spreadsheets
unresolved metric disputes
time to root-cause discrepancy
percentage of certified datasets consumed
count of legacy interfaces retired safely

Those are the business signs of trust.

Enterprise Example

Consider a global retailer modernizing its commerce and finance reporting.

The retailer has:

a legacy ERP for invoicing and general ledger
an e-commerce platform emitting order events
a warehouse management system for shipments
a payment gateway with settlement files
regional returns systems
a central data platform built on Kafka and cloud warehouse technology

Leadership wants near-real-time sales reporting and a single version of truth for revenue.

The first attempt goes badly. The platform team builds a canonical “sales fact” table fed by order events, payment captures, and invoice extracts. It looks elegant. It also produces endless disputes.

Why?

Because “sale” meant different things:

e-commerce treated order placement as sale intent
payments treated authorization as commercial commitment
finance recognized sale only after invoicing and delivery rules
returns restated net sales days later
regional tax treatment changed gross vs net totals

The table was tidy but dishonest.

The second attempt was better because it embraced domain semantics.

The retailer created separate domain data products:

Commerce Orders
Fulfillment Dispatches
Billing Invoices
Payments Captures and Settlements
Returns Accepted
Finance Revenue Recognition

Then they introduced reconciliation controls:

order count vs invoice count by region and day
payment captured total vs settlement total by processor batch
dispatch-to-invoice lag thresholds
returns offset logic by accounting period
net revenue certification after close adjustments

Operational dashboards were allowed to use preliminary order-based views with explicit labels. Finance used only certified revenue products. Regional analysts could inspect discrepancy stores to see why yesterday’s “sales” looked different depending on purpose.

The result was not one universal number. It was something better: a small set of numbers, each with known semantics, certification state, and reconciliation evidence.

The CFO stopped asking which dashboard was correct. That is architectural success.

Operational Considerations

A reconciliation architecture becomes real in operations, not in the deck.

Metadata matters

Capture source system, extract batch, event offset, processing version, schema version, business effective date, ingestion timestamp, and restatement markers. Without this metadata, discrepancy analysis turns into archaeology.

Late data is normal

Treat late-arriving events and corrections as first-class. Design windows, watermark policies, and restatement rules explicitly. Otherwise teams silently overwrite history and call it freshness.

Exception workflows need ownership

A discrepancy store without operational workflow is just a graveyard. Route exceptions to domain owners. Some mismatches belong to source system defects; some belong to platform transformation logic; some are accepted business timing differences.

Observability must include business controls

Technical monitoring is not enough. Track:

reconciliation pass/fail rates
certification delay
unmatched entity volumes
aggregate variance trends
restatement frequency
rule execution latency

The health of the platform should include semantic health, not just CPU and queue lag.

Security and compliance

Reconciliation often crosses regulated boundaries: finance controls, customer data, payment information, health records. Design access carefully. Investigators may need evidence without direct access to sensitive payloads. Masking, tokenization, and role-based access are not optional.

Tradeoffs

No worthwhile architecture comes free.

More explicit semantics means more modeling effort

Bounded contexts, certification states, and reconciliation rules require real domain work. This is slower than dumping everything into a warehouse and calling it democratization. It is also much more likely to succeed.

Reconciliation can increase latency

If you wait for matching, settlement, or period controls, some data products arrive later. That may frustrate teams used to streaming everything instantly. The answer is not to remove controls. The answer is to publish different products for different certainty levels.

You will duplicate some logic

The enterprise often needs both operational and financial interpretations of similar facts. Purists hate this. Real businesses need it. Some duplication is the price of honesty.

Control planes become products in their own right

Once you centralize certification and discrepancy management, you now own another platform capability. It needs APIs, dashboards, alerting, rule management, and governance. This is operationally heavier than ad hoc SQL checks, but far more sustainable. ArchiMate for governance

Failure Modes

Most trust architectures fail in recognizable ways.

The fake canonical model

Teams invent a universal business schema too early. Everyone maps to it loosely. Semantic edge cases pile up. Downstream consumers think consistency exists when it does not.

Reconciliation without domain ownership

Platform teams build controls, but no business owner accepts responsibility for discrepancy resolution. The exception queue grows, trust shrinks, and people return to spreadsheets.

Treating Kafka as sufficient audit evidence

Offsets and topics are useful. They are not a complete accounting control framework. If you cannot tie certified outputs back to source facts and business periods, streaming elegance will not save you.

Backfills that rewrite history invisibly

An engineer reruns a pipeline. Last quarter’s totals shift. No restatement marker exists. Executives notice before the platform team does. This is how reputations end.

Using only aggregate checks

Aggregate checks can pass while entity matching fails catastrophically. Ten duplicate invoices and ten missing invoices can cancel out numerically. Totals are necessary, not sufficient.

Ignoring timing semantics

Many “mismatches” are just process timing differences. If architecture does not model expected lag and business effective dates, teams chase ghosts.

When Not To Use

This approach is not mandatory for every data problem.

Do not build a full reconciliation architecture if:

you are running a small internal product with one operational system and low reporting risk
data is exploratory and not used for financial, regulatory, or operational decision rights
the domain is changing too quickly to stabilize semantics yet
the cost of formal certification outweighs the impact of occasional inconsistencies
there is no organizational appetite to assign domain ownership for exceptions

In these cases, lighter patterns may be enough: basic data quality checks, lineage, and simple source-aligned marts.

But once decisions affect money, compliance, customer commitments, inventory, or external reporting, trust becomes expensive to fake. That is the point where reconciliation architecture earns its keep.

Several adjacent patterns matter here.

Data mesh, with discipline

Data mesh gets one thing very right: domains should own their data products. But mesh without reconciliation becomes federated inconsistency. Domain ownership must be paired with cross-domain proof where enterprise measures matter.

Event sourcing

Event sourcing can improve traceability for certain operational domains. It helps preserve business history and temporal correctness. But event sourcing alone does not solve cross-domain semantic alignment.

CQRS

Separating operational write models from read models is often healthy. In data platforms, it reinforces the idea that reporting views are purposeful projections, not raw truth dumps.

Ledger and double-entry patterns

For financial domains, ledger-based modeling is often superior to status-based tables. Reconciliation becomes more robust when changes are additive, immutable, and auditable.

Master data and reference data management

Identity resolution and shared reference data help, especially for customer, product, and organizational hierarchies. But MDM is not a substitute for reconciliation. Shared IDs do not guarantee shared meaning.

Strangler fig migration

This is the right migration pattern for trust platforms because it allows parallel proving, progressive cutover, and retirement by confidence rather than by calendar promise.

Summary

The hardest part of a data platform is not storage, compute, or even integration. It is trust.

Trust appears when business semantics are respected, lineage is preserved, discrepancies are made visible, and cross-domain views are certified rather than merely assembled. That means adopting domain-driven design thinking in the data platform, resisting premature canonical models, and treating reconciliation as a first-class architectural capability.

Kafka helps. Microservices help. Lakehouses help. None of them solve the problem by themselves.

The real work is humbler and more valuable: preserve domain facts, reconcile them deliberately, expose certification state, and migrate progressively through a strangler approach that proves correctness in parallel with the old world.

Enterprises do not need one magic source of truth. They need a system that can explain why truths differ, which version is fit for purpose, and when a number is safe to trust.

That is what good reconciliation architecture buys you.

And in the end, that is what a data platform is for.

Frequently Asked Questions

What is a data mesh?

A data mesh is a decentralised data architecture where domain teams own and serve their data as products. Instead of a central data team, each domain is responsible for data quality, contracts, and discoverability.

What is a data product in architecture terms?

A data product is a self-contained, discoverable, trustworthy dataset exposed by a domain team. It has defined ownership, SLAs, documentation, and versioning — treated like a software product rather than an ETL output.

How does data mesh relate to enterprise architecture?

Data mesh aligns data ownership with business domain boundaries — the same boundaries used in domain-driven design and ArchiMate capability maps. Enterprise architects play a key role in defining the federated governance model that prevents data mesh from becoming data chaos.

Context

Problem

Forces

Domain semantics are local, but reporting is global

Event-driven architectures improve latency, but amplify inconsistency

Finance wants finality; operations want immediacy

Migration cannot stop the business

Data quality is not just null checks

Solution

Domain semantics first

Reconciliation as a first-class capability

Architecture

Logical architecture

Reconciliation patterns

1. Aggregate reconciliation

2. Entity reconciliation

3. Semantic reconciliation

Control-plane thinking

Migration Strategy

Step 1: Pick a business-critical value stream

Step 2: Preserve source semantics

Step 3: Build domain data products before enterprise marts

Step 4: Add reconciliation beside existing reporting

Step 5: Cut over by certification domain

Migration reasoning

Enterprise Example

Operational Considerations

Metadata matters

Late data is normal

Exception workflows need ownership

Observability must include business controls

Security and compliance

Tradeoffs

More explicit semantics means more modeling effort

Reconciliation can increase latency

You will duplicate some logic

Control planes become products in their own right

Failure Modes

The fake canonical model

Reconciliation without domain ownership

Treating Kafka as sufficient audit evidence

Backfills that rewrite history invisibly

Using only aggregate checks

Ignoring timing semantics

When Not To Use

Related Patterns

Data mesh, with discipline

Event sourcing

CQRS

Ledger and double-entry patterns

Master data and reference data management

Strangler fig migration

Summary

Frequently Asked Questions

What is a data mesh?

What is a data product in architecture terms?

How does data mesh relate to enterprise architecture?