⏱ 21 min read
Most AI platform programs fail in a surprisingly ordinary way.
Not because the models are weak. Not because GPUs are expensive. Not even because the data scientists picked the wrong framework. They fail because the enterprise treats AI as an application problem when it is, stubbornly and repeatedly, a data architecture problem first.
That sounds obvious. It isn’t. Large organizations still launch “AI platforms” as if they were shiny product layers hovering above the estate: a notebook environment here, a feature store there, some model registry, a vector database, a generous amount of Kubernetes, and a budget line called innovation. Six months later, the platform exists, the demos work, and the business still cannot trust the output. The hard part was never spinning up ML infrastructure. The hard part was understanding what the business means by a customer, an order, a policy, a shipment, a claim, a part, a price, a risk event. AI only amplifies the quality of that understanding. If the semantics are muddy, the model will industrialize confusion.
That is why pipeline topology matters. Not as a diagramming exercise. Not as a technology inventory. Pipeline topology is where the enterprise reveals what it believes about time, ownership, truth, latency, and correction. It shows whether data is treated as a product, as an exhaust stream, or as a political battleground. And when you are building AI platforms, those beliefs become operational constraints.
The essential point is simple: AI platforms should be designed as domain-aligned data systems with explicit semantics, governed flows, reconciliation paths, and migration strategies that respect the existing estate. Everything else is implementation detail.
Context
Enterprises are under pressure to industrialize AI. That pressure comes from everywhere at once: customer support copilots, fraud detection, predictive maintenance, underwriting assistants, document extraction, pricing optimization, internal search, and increasingly agentic workflows that promise to orchestrate work across systems.
So organizations react in the way organizations react. They create an AI platform team.
The team is asked to provide standard tooling for data ingestion, feature engineering, experimentation, training, deployment, observability, guardrails, and governance. This is reasonable. Standardization matters. Shared infrastructure matters. Reuse matters. EA governance checklist
But there is a trap here. A platform team can standardize plumbing without improving meaning. It can create consistency at the infrastructure layer while business entities remain incoherent across domains. In that world, “AI platform” becomes a thin layer of central tools sitting on top of fragmented semantics. The result is predictable: duplicate pipelines, endless mapping logic, feature drift, contested definitions, and models that are technically deployable but organizationally untrustworthy.
The modern stack makes this easier to hide. Kafka can move events quickly. Microservices can package logic neatly. Lakehouses can store vast amounts of data cheaply. Vector stores can make retrieval look magical. None of these solve the primary problem. They just move it around faster. event-driven architecture patterns
A healthy architecture starts from a less glamorous premise: before you build model pipelines, you must decide where business truth is created, how domain events are expressed, how corrections propagate, how historical versions are preserved, and who owns semantic definitions. This is domain-driven design territory, not because DDD is fashionable, but because AI consumes domain meaning at scale.
Problem
Most AI platform architectures collapse three very different concerns into one:
- Operational transaction processing
- Analytical and historical interpretation
- AI-oriented data preparation and inference context
That collapse creates pipelines that look efficient on slides and behave badly in production.
An order service emits “OrderCreated.” A CRM exports “customer” records. An ERP publishes “invoice” files nightly. A product team builds embeddings over support articles. A fraud model consumes account behavior. A recommendation system wants real-time interaction streams. Somewhere in the middle, a central platform team creates canonical tables and hopes for the best.
The problem is not that these sources differ. The problem is that they differ in meaning, timing, and authority.
A customer in marketing is often a prospect. A customer in billing is a legal account. A customer in service is a contact context. A customer in identity is a principal. If your platform treats these as one thing because the schema says customer_id, your AI architecture is already lying.
This is where many pipeline topologies go wrong. They optimize for movement instead of interpretation. They ask, “How do we get all data into the platform?” when the better question is, “What business event or state transition does this data represent, and who has the authority to define it?”
Without that discipline, AI systems inherit all the classic data platform pathologies:
- training-serving skew
- stale features
- duplicate entity resolution
- hidden batch dependencies
- brittle orchestration
- non-reproducible datasets
- policy violations across jurisdictional boundaries
- impossible root-cause analysis when outputs are challenged
In other words, the platform works right up until somebody important asks, “Why did the model say that?”
Forces
There are several forces pulling the architecture in different directions.
1. Domain autonomy versus enterprise consistency
Business domains want control. They should. A claims domain understands claims better than a central platform team ever will. But the enterprise also needs consistency for cross-domain AI use cases: fraud, churn, pricing, forecasting, risk, planning. You cannot centralize semantics without losing local nuance. You cannot fully decentralize without creating chaos.
This is the classic DDD tension. Bounded contexts matter. So do translation layers.
2. Event speed versus reconciled truth
Kafka and event streaming are excellent for propagating changes and enabling reactive systems. But streams are not automatically truth. Events arrive out of order, get replayed, are missing context, or represent provisional states later corrected by batch processes. Fast data is seductive. Reconciled data is what the auditors, regulators, and senior operators care about.
3. Local optimization versus platform operability
Every team can build a pipeline suited to its needs. Left unchecked, they will. The result is a museum of bespoke jobs, hidden transformations, hand-maintained mapping tables, and heroic tribal knowledge. Operability requires standard patterns: contracts, lineage, replay, quality gates, observability, and retention.
4. Historical fidelity versus simplified consumption
Data scientists want broad, easy access. Business users want understandable outputs. Engineers want minimal duplication. Meanwhile, good AI requires preserving temporal reality: what was known, when it was known, and under what assumptions. Flattening everything into “current state” is a common architectural sin.
5. Central governance versus product delivery speed
Governance has a bad reputation because it often arrives as committee theater. But in AI, governance is not paperwork. It is the mechanism for controlling data rights, PII usage, model provenance, policy boundaries, and explainability. Too much centralization kills delivery. Too little governance creates an expensive incident waiting to happen. ArchiMate for governance
Solution
The strongest pattern I have seen is this: design the AI platform as a set of domain-aligned data products and pipeline topologies with explicit semantic contracts, separated into distinct layers of authority.
Not one giant data swamp. Not one omniscient canonical model. Not a mesh without standards. And not a central AI team that quietly becomes the integration department.
The architecture should distinguish four things clearly:
- Source-of-record operational domains
- Event and change propagation topology
- Reconciled analytical history
- AI-ready serving contexts for specific use cases
That separation is the difference between a durable platform and a pile of pipelines.
A practical principle helps: model data where meaning is born, not where it becomes convenient.
If the underwriting domain defines risk classifications, let that domain publish risk events and state changes in its own bounded context. If finance owns invoice finalization, do not infer “finalized” from a downstream warehouse job. If identity owns authentication events, do not let five teams create slightly different user-session abstractions.
Then build translation, reconciliation, and projection deliberately.
A layered semantic topology
The pipeline topology I recommend has three major semantic transitions:
- Operational semantics: the meaning as expressed by the domain system
- Reconciled enterprise semantics: conformed, time-aware, corrected views used for analysis and control
- Use-case semantics: AI-ready projections shaped for training or inference
These are not cosmetic layers. They exist because different consumers need different truths.
A fraud model may need low-latency behavioral events with probabilistic identity linkage. A finance forecast model may require month-end reconciled facts. A support copilot may need document chunks, customer entitlement state, and active case context. Forcing them into one universal shape is how architectures become both rigid and unreliable.
Notice what is not happening here. We are not feeding models directly from operational systems and hoping observability will save us. We are not assuming Kafka topics are analytically consumable. We are not pretending all AI workloads have the same latency and correctness needs.
Architecture
Let’s make this concrete.
1. Domain-owned source products
Each core business domain publishes data products from its bounded context. In DDD terms, the bounded context is where language is coherent and business rules make sense together. That is where semantics should be authored.
These products can take several forms:
- event streams on Kafka
- CDC feeds from transactional stores
- snapshot APIs
- periodic files where necessary, though usually as a migration compromise
The key is not the transport. The key is the contract.
A claims domain might publish:
ClaimRegisteredClaimAssessmentCompletedClaimPaymentAuthorizedClaimClosedClaimReopened
Those are better than dumping a giant mutable claims table into a lake and calling it platform-ready. Events carry intent and time. They tell the business story.
Still, events alone are not enough. Domain products should include:
- schema contracts
- semantic definitions
- quality expectations
- ownership metadata
- sensitivity classification
- retention and replay rules
This is where many microservices estates disappoint. Teams publish events, but not meaning. An event named CustomerUpdated that can mean “address changed,” “privacy preference updated,” or “duplicate record merged” is not a domain contract. It is a cry for help.
2. Immutable raw capture
Store raw events and change history immutably. Always.
Why? Because AI asks uncomfortable retrospective questions. You will need to reproduce training sets, investigate model outputs, reprocess after bug fixes, and compare old semantics to new semantics. If your platform only keeps transformed current-state tables, you have thrown away the evidence.
Raw history is not for direct business consumption. It is for lineage, replay, auditability, and recovery. Think of it as the black box recorder of the platform.
3. Reconciliation and conformance
This is the layer people underestimate, and it is usually where enterprise reality reappears with a baseball bat.
Reconciliation is where you deal with:
- late-arriving events
- duplicate entities
- corrected transactions
- conflicting identifiers
- reference data alignment
- legal restatements
- regional policy differences
- survivorship rules
This layer is not glamorous, but it is where trust is built.
An AI platform that skips reconciliation will create endless disputes between operational teams and data teams. The model says one thing, finance says another, customer operations says the customer record is wrong, and nobody can trace the semantic path.
Reconciliation should be explicit and versioned. If identity matching rules change, that should create a new semantic version of the reconciled product. If financial postings are restated, downstream consumers should know whether they are using provisional or final facts.
That last arrow matters. Reconciliation should not become a silent downstream patch factory. If the platform is repeatedly correcting broken source semantics, the architecture is rotting. Push defects back to the source domain where possible.
4. AI-ready context products
The final layer consists of products tailored to AI use cases.
Not every AI workload wants “features” in the traditional tabular sense. A support agent may need:
- customer state
- entitlement level
- open case timeline
- recent product telemetry
- document retrieval context
- policy constraints on generated responses
A demand forecast may need:
- corrected order history
- promotional calendar
- inventory state
- lead time distributions
- store closures
- weather enrichment
An underwriting assistant may need:
- structured applicant history
- document-derived entities
- fraud flags
- policy rules
- explanation lineage
These are not all the same shape. That is exactly the point. The enterprise platform should provide standardized ways to build and govern them, not force them into one universal storage abstraction.
5. Online and offline parity
If the same logic defines customer risk score features for both training and inference, the architecture should not duplicate that logic in two unrelated pipelines. This is where feature stores are useful, though often oversold. The valuable part is not the product label. It is the discipline of consistent definitions, point-in-time correctness, and serving parity.
For generative AI, the equivalent concern is context parity. If the runtime prompt assembles entitlement rules differently from the evaluation or test environment, quality will drift and nobody will know why.
6. Metadata, lineage, and policy as first-class architecture
In AI platforms, metadata is not administration. It is control.
You need lineage from source domain event to reconciled product to training dataset to model version to inference endpoint. You need policy evaluation over PII, retention, region, lawful basis, and usage rights. You need semantic catalogs that explain not just columns but business meaning.
If this sounds heavyweight, good. Enterprises confuse heavy with unnecessary. In reality, the lack of this machinery is what makes systems brittle and political.
Migration Strategy
No serious enterprise gets to start clean. The estate already exists: warehouses, MDM hubs, ESBs that nobody loves but everybody still needs, file transfers, scheduled extracts, APIs with suspicious SLAs, and microservices that publish events of wildly varying quality. microservices architecture diagrams
So the migration strategy must be progressive. A strangler approach is the sensible path.
Do not begin by replacing the entire data platform. Begin by introducing a new semantic topology beside the old one, use it for one or two high-value AI use cases, and expand as confidence grows.
Step 1: Pick a high-value bounded context
Choose a domain with clear ownership and urgent demand: claims fraud, service operations, pricing, document automation. Avoid cross-enterprise “360” programs as your first move. They look strategic and die in committee.
Step 2: Capture raw changes without disrupting sources
Use CDC, event subscriptions, and existing extracts if necessary. The point is to establish immutable historical capture first. You can improve semantics later, but you cannot recover time if you never captured it.
Step 3: Define domain contracts with business language
Work with domain owners to clarify event meaning, state transitions, identifiers, and authority. This is where DDD earns its keep. Ubiquitous language is not a workshop artifact; it is the basis for trustworthy AI input.
Step 4: Build reconciliation for one use case
Do not try to reconcile the enterprise in one release. Reconcile only the facts needed for the selected use case. Establish patterns for quality checks, correction handling, and semantic versioning.
Step 5: Publish AI-ready products
Create a training dataset and serving context product from the reconciled layer. Measure not just model performance, but dispute rate: how often business stakeholders challenge the data basis of outputs.
Step 6: Strangle legacy pipelines gradually
As confidence builds, route more consumers to the new reconciled products and retire old ETL chains. Keep a period of dual-run where outputs are compared and explained.
Reconciliation during migration
Migration is where reconciliation becomes painfully practical. Legacy systems often contain hidden compensations: hand-maintained adjustment tables, end-of-month corrections, unspoken survivorship rules in BI logic, operational workarounds embedded in spreadsheets. If you rebuild the pipeline naively, the new platform may be “cleaner” and less correct.
The right approach is to make these compensations visible, classify them, and decide which ones are:
- source defects to fix upstream
- enterprise policies to encode centrally
- reporting artifacts to retire
- temporary bridge logic during transition
This is tedious work. It is also the work that separates architecture from drawing software.
Enterprise Example
Consider a multinational insurer building an AI platform for claims operations.
They want three things at once:
- fraud detection on incoming claims
- a claims handler copilot
- portfolio-level loss forecasting
The first instinct is often to create a centralized AI lake with policy, customer, claim, payment, and call center data ingested from everywhere.
That would be a mistake.
The insurer’s claims domain, policy administration domain, customer servicing domain, and finance domain all define key entities differently. A “claim” can be registered, assessed, reserved, litigated, reopened, subrogated, and settled. Those transitions matter. A customer may be an individual, household, broker-represented party, claimant, or policyholder. Finance does not recognize a payment the same way claims operations does. If you flatten all of that prematurely, your fraud model and copilot will make contradictory assertions.
A better architecture is domain-aligned.
- Claims publishes domain events for registration, assessment, reserve changes, handler assignment, payment authorization, closure, and reopening.
- Policy administration publishes coverage and endorsement state transitions.
- Customer servicing publishes contact and communication preference events.
- Finance publishes settled payment facts and restatements.
- Document processing publishes extracted entities with confidence scores, not false certainty.
Kafka is used for event propagation where real-time matters: claim intake, fraud scoring triggers, assignment changes. Historical raw capture lands in immutable storage. Reconciliation builds a conformed claim timeline, linked to policy coverage status and payment finalization status. Identity resolution is probabilistic and versioned because household and claimant relationships are messy.
The fraud model consumes near-real-time behavioral and claim context projections, accepting some provisional truth. The forecasting model consumes reconciled monthly facts only. The claims copilot uses retrieval over policy documents and claim notes, but grounds its responses in reconciled entitlement and claim-state products, not raw note interpretation.
This architecture does not eliminate disagreement. It localizes it. Fraud operations understand they are using provisional, high-velocity data. Finance understands it owns final payment truth. Claims handlers know the copilot’s current claim state comes from the claims domain timeline, with traceable provenance.
That is an enterprise architecture win. Not perfection. Clarity.
Operational Considerations
A few practical concerns deserve blunt treatment.
Data quality must be executable
Dashboards about data quality are not enough. Quality checks should gate promotions between layers. Missing mandatory identifiers, impossible state transitions, broken reference mappings, and policy violations should trigger quarantine, alerts, and issue workflows.
Replay is non-negotiable
If you cannot replay streams and rebuild downstream products, your platform is fragile. AI platforms change more often than traditional reporting stacks. New features, new prompt assembly logic, new entity rules, new compliance requirements—reprocessing is routine.
Backfills are architecture, not operations
The moment you start training on historical data, backfills become a first-class concern. Design for them explicitly. Otherwise, every historical rebuild becomes a dangerous one-off.
Cost discipline matters
Streaming everything at low latency is not architecture sophistication. It is often just expensive indecision. Use streaming where business timing requires it. Use batch where it is sufficient and more robust. Plenty of AI workloads are perfectly well served by hourly or daily refresh with strong reconciliation.
Governance needs technical enforcement
Policies about PII, residency, retention, and model usage should be enforced through metadata, access control, data product contracts, and runtime checks. A PDF governance document is not a control.
Tradeoffs
There is no free architecture.
A domain-aligned semantic topology brings clarity and trust, but it also introduces complexity:
- more explicit contracts
- more metadata work
- more upfront domain modeling
- more negotiation about ownership
- more investment in reconciliation
Teams that want instant self-service may complain that this is slower. Sometimes it is. In the short term.
The alternative is fake speed: rapidly producing AI systems that later stall under disputes, rework, and compliance concerns.
There are also tradeoffs in topology choices:
Event-first architecture
- Strong for responsiveness and decoupling
- Weak when teams mistake event streams for complete business truth
Warehouse-first architecture
- Strong for historical analytics and control
- Weak for low-latency operational AI and real-time context assembly
Feature-store-centric design
- Strong for consistent ML features
- Weak if it becomes another central abstraction detached from domain ownership
Data mesh style decentralization
- Strong for domain accountability
- Weak if standards, interoperability, and platform capabilities are too thin
My view is opinionated: combine domain ownership with a strong central platform for contracts, lineage, reconciliation tooling, and policy enforcement. Decentralize semantics to the people who understand them; centralize the machinery needed to make those semantics operable at enterprise scale.
Failure Modes
Architectures fail in recognizable ways. Watch for these.
1. The canonical model fantasy
A central team tries to define one universal enterprise model for customer, order, policy, product, and interaction. It looks elegant and dies in endless exception handling. Bounded contexts exist for a reason.
2. Event theater
Teams publish Kafka topics and declare victory. The topics are poorly defined, inconsistently versioned, and semantically vague. Downstream teams rebuild business meaning through guesswork.
3. Hidden reconciliation
Business-critical corrections happen silently in notebooks, BI layers, or one-off ETL jobs. Nobody can explain why model outputs differ from source systems.
4. Raw-to-model shortcuts
Pressure for speed leads teams to train directly from raw ingested data. It works for a pilot. It fails in production when source quirks, missing corrections, or identifier churn cause instability.
5. Central platform overreach
The platform team begins owning business transformation logic because domain teams are slow or inconsistent. Short-term delivery improves. Long-term accountability collapses.
6. Governance by afterthought
Privacy, retention, residency, and usage rights are bolted on after the first successful use case. Retrofitting controls later is painful and politically costly.
When Not To Use
This approach is not always the right answer.
Do not build this level of architecture if:
- your use case is narrow, low-risk, and can tolerate manual data preparation
- the domain is immature and semantics are changing weekly
- the organization lacks even basic data ownership and operational discipline
- you are a small company with a handful of systems and no serious regulatory burden
- the AI use case is experimental and not yet tied to business operations
In those situations, lighter-weight pipelines may be entirely sensible. Architecture should solve the problem you have, not the conference talk you admired.
But once AI outputs affect customer decisions, financial controls, regulated processes, or operational workflows, casual data architecture stops being charming. It becomes negligence.
Related Patterns
Several related patterns complement this approach.
- Bounded Contexts and Context Mapping from domain-driven design for semantic clarity
- CQRS and event-driven projections for separating write models from read and AI-serving models
- Data products for ownership and discoverability
- CDC-based migration for low-disruption capture from legacy systems
- Strangler Fig migration for incremental replacement of legacy data flows
- Medallion-style layering, if used carefully, though I prefer speaking in terms of semantic transitions rather than metallic metaphors
- Feature stores and retrieval pipelines for online/offline consistency
- Data contracts for controlled evolution of source products
- Master/reference data services where shared identifiers and codes must be governed centrally
The caution with related patterns is always the same: do not adopt them as badges. Use them to enforce semantic discipline and operational resilience.
Summary
If you remember one thing, remember this: AI platform architecture is data architecture first because AI scales semantics, not just computation.
The platform must begin where business meaning is created, not where data becomes easy to ingest. It must distinguish operational truth from reconciled truth and both from AI-serving context. It must treat domain contracts, reconciliation, lineage, and policy as core design elements, not optional governance decorations. It must migrate progressively, with strangler patterns and dual-run comparisons, because enterprises do not get to reboot themselves.
Kafka helps. Microservices help. Lakehouses help. Feature stores help. Vector databases help. None of them rescue a platform that does not know what its business entities mean or how corrections flow through time.
Good AI architecture is not a tower built above the enterprise. It is a set of well-governed pipelines rooted in the reality of the business.
That reality is messy. Build for it anyway.
Frequently Asked Questions
What is cloud architecture?
Cloud architecture describes how technology components — compute, storage, networking, security, and services — are structured and connected to deliver a system in a cloud environment. It covers decisions on scalability, resilience, cost, and operational model.
What is the difference between availability and resilience?
Availability is the percentage of time a system is operational. Resilience is the ability to recover from failures — absorbing disruption and returning to normal. A system can be highly available through redundancy but still lack resilience if it cannot handle unexpected failure modes gracefully.
How do you model cloud architecture in ArchiMate?
Cloud services (EC2, S3, Lambda, etc.) are Technology Services or Nodes in the Technology layer. Application Components are assigned to these nodes. Multi-region or multi-cloud dependencies appear as Serving and Flow relationships. Data residency constraints go in the Motivation layer.