⏱ 21 min read
Most enterprise AI programs begin with a flattering lie.
The lie is that the hard part is the model.
So teams spend months debating model providers, GPU capacity, prompt frameworks, vector databases, agent runtimes, and benchmark scores. They hold architecture reviews full of arrows pointing at LLMs as if intelligence itself were the center of gravity. But once these systems reach production, the truth arrives with very little ceremony: the model is often the easiest component to replace. The hard part is deciding what data reaches it, under what semantics, through which controls, with what timing guarantees, and how the result gets reconciled back into the business.
That is why an enterprise AI platform is, in practice, a data routing layer.
Not a chatbot shell. Not a prompt catalog. Not a magical “AI fabric.” A routing layer.
It sits between domains, policies, systems of record, event streams, inference endpoints, and user-facing workflows. Its job is not merely to call a model. Its job is to shape inference topology: where inference happens, what context is attached, which policies apply, how partial results are handled, and how decisions are observed, corrected, and audited.
This distinction matters because enterprises do not suffer from a shortage of models. They suffer from semantic drift, duplicated orchestration, uncontrolled data movement, and brittle coupling between business workflows and probabilistic services. In other words, they suffer from architecture problems.
And architecture problems don’t yield to better prompts.
What follows is an opinionated view: if you are building AI in a serious enterprise, think less about “AI apps” and more about inference topology. Think in domains. Think in flows. Think in event boundaries and policy enforcement points. Think in reconciliation paths for when the model is wrong, late, or unavailable. If you do that, your platform starts looking less like a collection of AI tools and more like what it really is: a controlled routing system for business meaning.
Context
The last decade trained enterprise architects to think in APIs, events, and product-aligned microservices. Systems were decomposed around business capabilities. Kafka became the backbone for streaming facts across domains. Teams learned, sometimes painfully, that a service boundary is not a deployment trick but a semantic commitment. event-driven architecture patterns
AI reopens that lesson.
The naive approach inserts model calls directly into channels: the web app calls the LLM, the claims app calls the classifier, the support bot calls retrieval and generation, the fraud engine calls a feature store and a model endpoint. Every product team builds some variation of context assembly, prompt management, fallback logic, content filtering, provider switching, and response logging. The result looks agile for about six months. Then it calcifies into a distributed mess of duplicated policy and inconsistent semantics.
One team defines “customer” as CRM profile plus billing history. Another includes recent support incidents. A third redacts fields differently. A fourth sends raw notes to a public model because “it was just a prototype.” You can guess what happens next: compliance panics, security inserts gateways after the fact, platform teams scramble to centralize controls, and the business wonders why every AI use case feels custom.
This is not new. We have seen the pattern before with integration platforms, service buses, API gateways, and event backbones. The names change; the gravity does not. Whenever a capability depends on moving information between domains under policy, topology becomes architecture.
AI simply raises the stakes because inference is probabilistic, context hungry, and operationally expensive. A bad API call usually fails loudly. A bad AI route often succeeds quietly.
That is worse.
Problem
Most enterprise AI architectures are built around inference endpoints, but production reality is driven by routing decisions.
A customer service answer may need:
- customer identity from CRM
- policy coverage from an insurance system
- recent claim events from Kafka
- sensitive note redaction from a privacy service
- retrieval from a knowledge corpus
- a model selected by jurisdiction and cost tier
- post-processing by a rules engine
- human review if confidence is low
- result reconciliation back into a case management system
The business requirement is not “call model X.” The requirement is “produce an answer in a workflow, using governed context, under domain rules, with a recoverable audit trail.”
That is routing.
The same applies to document extraction, fraud triage, next-best-action, underwriting summarization, coding assistance, pricing recommendations, and internal knowledge assistants. In every case, the critical decisions are upstream and downstream of the model:
- what facts are assembled
- whether those facts are fresh enough
- how domain language is translated
- whether the use case requires synchronous or asynchronous inference
- who owns correction
- how outputs become business state
When teams ignore that, they create a familiar anti-pattern: AI orchestration embedded inside every application. The short-term benefit is speed. The long-term cost is semantic entropy.
This entropy shows up in several ugly ways:
- Context fragmentation
Different teams build different views of the same business entity.
- Policy inconsistency
PII redaction, retention rules, and provider restrictions vary by implementation.
- Coupling to model vendors
Provider-specific prompts, schemas, and failure handling leak into product code.
- No reconciliation path
AI outputs are accepted optimistically with no durable correction model.
- Observability without meaning
Metrics report latency and token counts, but not business correctness by domain outcome.
- Migration paralysis
Once model calls are woven into dozens of services, changing topology becomes a change program.
The platform problem is not “How do we expose AI to teams?” It is “How do we route data and decisions through inference in a way that preserves domain integrity?”
Forces
Enterprise architecture is the art of surviving conflicting truths. AI adds more of them.
1. Domain semantics versus centralized reuse
A platform team wants consistency. Domain teams want control. Both are right.
A claims domain understands the difference between first notice of loss, adjudication, reserve adjustment, and settlement exception. A central AI team does not. But domain teams should not each reinvent provider routing, redaction, caching, and fallback controls. The right split is not centralize everything or federate everything. The right split is centralize routing capabilities while keeping domain semantics local.
That is textbook domain-driven design, though people often forget the “domain” part when the topic becomes AI.
2. Synchronous user expectations versus asynchronous enterprise reality
Users expect instant answers. Enterprises run on eventually consistent workflows.
Some inference belongs in the request path: search augmentation, chat response drafting, agent assistance. Other inference should be event-driven: claim document extraction, policy anomaly detection, case summarization, downstream enrichment. Treating all AI as synchronous creates latency, cost, and resilience problems. Treating all AI as async degrades user experience.
Topology matters because different workloads need different routes.
3. Cost versus quality
The best model is rarely the one you should call every time.
You may want a small model for triage, a larger model for escalation, deterministic rules for known paths, and human review for edge cases. This is not a model decision. It is a routing policy decision informed by business value.
4. Governance versus delivery speed
Security teams want approved providers, jurisdiction controls, data minimization, retention, and auditability. Delivery teams want velocity. If governance is bolted on later, it becomes a tax. If governance is built into routing, it becomes a path selection rule. EA governance checklist
5. Freshness versus stability
Some inferences need live operational data. Others need curated snapshots to avoid nondeterministic prompts and replay issues. Route live data everywhere and you lose reproducibility. Route only snapshots and you lose relevance.
6. Enterprise integration gravity
Most serious AI use cases touch Kafka, service APIs, identity systems, MDM, content stores, and systems of record. The architecture must respect the existing integration estate, not fantasize that an AI platform will replace it.
It won’t.
Solution
Treat the AI platform as an inference routing layer with explicit domain contracts.
That means the platform owns the mechanics of inference routing:
- policy enforcement
- provider and model selection
- prompt and tool execution infrastructure
- context assembly framework
- retrieval plumbing
- observability
- fallback and retry behavior
- asynchronous job handling
- response normalization
- audit and lineage
But it does not own business meaning. Domains own:
- entity definitions
- event semantics
- workflow boundaries
- acceptance criteria
- human review rules
- correction and reconciliation processes
- business KPI instrumentation
This separation is the difference between useful centralization and another platform that everyone bypasses.
A good inference routing layer behaves like a logistics network for meaning. It does not generate value by itself. It moves the right material to the right place under the right conditions. It knows which roads are allowed, which cargo needs special handling, and where customs checks occur. It does not pretend all packages are the same.
Core principles
1. Route by intent, not by model
Applications should ask for a business capability, not a provider-specific invocation.
Bad:
call-gpt4-with-this-prompt
Better:
generate_claim_summaryclassify_fraud_signaldraft_customer_response
Intent-based routing preserves the option to change models, prompts, tools, and policies without rewriting every caller.
2. Keep domain context products separate from platform mechanics
The claims domain should publish a context contract for “claim summary context.” The platform should know how to assemble, redact, cache, and route it. The platform should not invent claim semantics.
3. Make reconciliation a first-class path
Inference outputs must be correctable. If the model extracts a diagnosis code incorrectly or drafts a wrong explanation of benefits, there must be a workflow for review, correction, replay, and lineage.
4. Support multiple inference topologies
You need request/response, event-driven enrichment, batch scoring, streaming inference, human-in-the-loop escalation, and hybrid patterns. One topology will not fit every use case.
5. Measure business outcomes, not just technical throughput
Latency matters. Token cost matters. But business accuracy by domain scenario matters more.
Architecture
A practical architecture usually has five layers.
- Experience and process layer
Portals, agent desktops, digital channels, BPM/workflow tools, case management.
- Domain services and event backbone
Microservices, Kafka topics, domain APIs, systems of record.
- Inference routing layer
The AI platform proper: policy, orchestration, context assembly, provider abstraction, prompt/tool runtime, guardrails, observability.
- Knowledge and context sources
Document stores, search indexes, vector retrieval, feature stores, master data, policy repositories.
- Inference execution endpoints
LLMs, classifiers, embeddings, OCR, speech, custom ML services.
Here is the shape of it.
This is not an ESB revival with better marketing. The difference is in the contracts. An old-school integration bus often centralized business transformation logic until it became a bottleneck. A sound inference routing layer centralizes generic inference concerns while domain transformations remain close to the domain.
That boundary is everything.
Domain semantics discussion
If you ignore semantics, your AI platform becomes a very expensive string processor.
Take “customer.” In a bank, the retail customer domain, fraud domain, collections domain, and onboarding domain each have legitimate but different views. An onboarding assistant may need KYC status and document deficiencies. A collections assistant may need delinquency stage and hardship indicators. A fraud triage service may need device risk and linked-party analysis. These are not technical variations; they are bounded contexts.
So the platform should not expose one giant “customer context” endpoint. That way lies leakage, over-fetching, and accidental data exposure.
Instead, each domain should define context products with explicit semantics:
OnboardingApplicantContextCollectionsAccountContextFraudCaseContext
The routing layer can then apply common mechanics to all of them:
- fetch and merge
- redact
- enrich
- cache
- route to suitable model
- normalize output
- emit audit event
This is domain-driven design with operational teeth.
Inference topology patterns
At least four topologies appear repeatedly.
1. Inline assist
The user is waiting. Latency budgets are strict. The route must be fast, bounded, and often partial.
2. Event-driven enrichment
A domain event lands on Kafka. The routing layer enriches a case, document, or entity asynchronously.
3. Batch or backlog processing
Large document sets, historical backfills, nightly prioritization. Cheap models and resilient queues matter more than chat-like responsiveness.
4. Escalation topology
A cheap route handles the common case. A richer route or human review handles ambiguity.
A mature platform supports all four without making every application team solve them independently.
Kafka and microservices
Kafka is particularly useful when AI should follow business events rather than sit awkwardly in front of them.
Examples:
- ClaimSubmitted triggers document classification and summary generation.
- PaymentExceptionRaised triggers anomaly explanation.
- CustomerInteractionClosed triggers auto-summarization and next-best-action recommendation.
- ProductSpecUpdated triggers embedding refresh and retrieval index update.
The pattern is simple: domain services publish facts. The routing layer subscribes where AI enrichment is appropriate. It does not become the source of truth. It emits derived facts or recommendations back into the ecosystem, ideally on separate topics with clear semantics such as ClaimSummaryGenerated or FraudCaseTriageSuggested.
That separation matters. Generated output is not the same as approved business state.
Migration Strategy
Most firms already have AI logic scattered across applications. So the migration strategy should be progressive strangler, not big bang.
Big bang rewrites are how architecture becomes theatre.
A better sequence looks like this:
Step 1: Identify duplicated inference mechanics
Find the repeated capabilities:
- provider SDK wrappers
- prompt templates
- PII redaction
- retry logic
- output parsing
- usage logging
- fallback between models
These are your first candidates for centralization.
Step 2: Introduce intent-based APIs
Wrap existing direct model calls with business-intent interfaces. Do not change domain behavior yet; simply hide provider details.
Step 3: Externalize policy and routing
Move model selection, redaction, provider restrictions, and prompt/tool configuration into the platform. Applications still call the same business intents, but mechanics become centralized.
Step 4: Shift asynchronous use cases onto Kafka-driven flows
Document extraction, summarization after case closure, enrichment after event arrival—move these out of request threads and into durable event processing.
Step 5: Establish reconciliation
Create review queues, correction UIs, lineage IDs, replay support, and event models for accepted versus suggested outputs.
Step 6: Refactor domain context products
As teams gain confidence, replace ad hoc data gathering with explicit context contracts owned by domains.
Step 7: Retire direct model integrations
Only after policy, observability, and reconciliation are in place should teams remove direct provider dependencies from applications.
Here is the migration shape.
Reconciliation discussion
Reconciliation is where many AI architectures quietly fail.
A generated summary may be edited by a human. An extracted field may be corrected. A recommendation may be accepted, overridden, or ignored. If you overwrite the original state without lineage, you lose the ability to learn, audit, or replay. If you treat suggestions as final truth, you invite silent corruption.
A robust model uses at least three states:
- generated suggestion
- human or rule-validated acceptance
- committed business fact
The routing layer should assign correlation IDs and version context inputs. Domain workflows should record whether outputs were accepted or corrected. Kafka topics can then carry both generated and reconciled events.
This is especially important during migration. For a while, old and new paths may coexist. Reconciliation becomes the mechanism that lets you compare outcomes safely.
Enterprise Example
Consider a global insurer modernizing claims operations across property, auto, and travel lines.
They began, as many do, with local experiments. The call center had a summarization bot. Property claims had document extraction using one cloud provider. Auto had a fraud note classifier built by a data science team. Travel claims used another vendor entirely for multilingual response drafting. Every team was productive. Every team was also creating a future problem.
The same claimant data was routed differently by line of business. Some flows redacted medical details; others did not. Several use cases embedded provider-specific prompts inside microservices. There was no common lineage model. Case workers corrected AI outputs every day, but those corrections disappeared into screens rather than feeding improvement loops. Security found outbound data patterns they could not explain with confidence. microservices architecture diagrams
The insurer did not need “one model.” It needed one routing strategy.
What they built
They introduced an inference routing layer between claims applications and model providers.
The claims domain defined context products:
FNOLContextClaimDocumentContextAdjusterCaseContextFraudReferralContext
A Kafka backbone already existed, so they used events aggressively:
ClaimOpenedDocumentReceivedCaseAssignedInvestigationRequestedClaimClosed
The routing layer subscribed to these events and triggered different inference topologies:
- OCR and extraction on
DocumentReceived - summarization on
CaseAssigned - triage recommendation on
InvestigationRequested - closure summary on
ClaimClosed
For real-time agent assist, the call center UI invoked an intent like draft_claimant_explanation. The routing layer assembled only the context allowed for that jurisdiction, selected a lower-latency model for common interactions, and escalated to a stronger model only when confidence was low or the conversation involved policy interpretation.
Why it worked
Because domain semantics stayed with the claims teams.
The central platform team did not define what a reserve adjustment meant or which investigation reasons mattered. It handled redaction, provider switching, audit logs, prompt runtime, and normalization. Claims teams owned acceptance logic and review workflows.
The key operational improvement
They created a reconciliation model in the case management system:
- AI suggestion stored as suggestion
- adjuster edits captured separately
- approved case summary committed as business record
- acceptance/correction event emitted for analytics
Within six months, they could answer questions that had previously been impossible:
- Which use cases save handling time by claim type?
- Which jurisdictions require a different routing policy?
- Where are human corrections concentrated?
- Which providers create cost spikes without corresponding business value?
- Which context fields correlate with bad recommendations?
That is the kind of visibility enterprises actually need. Not “our prompt score improved by 12%.”
Operational Considerations
Operational excellence here is not just uptime. It is controlled meaning under production stress.
Observability
Track technical and business metrics together:
- latency by route and use case
- token and provider cost
- retry rates
- fallback frequency
- policy denials
- confidence distribution
- human correction rates
- business outcome deltas
A route that is cheap and fast but repeatedly corrected is not a good route.
Caching and replay
Context caching can reduce latency and cost, but stale context can poison outputs. Cache immutable or slow-changing enrichment aggressively. Be more careful with operational state.
Replay matters for incidents, audits, and migrations. Version prompts, tools, context schemas, and policy sets so you can explain why a result occurred.
Security and privacy
Put policy enforcement before provider invocation, not after. Redact or tokenize sensitive fields based on domain and jurisdiction. Keep provider-specific legal constraints outside application code.
Resilience
Models timeout. Providers throttle. Retrieval stores go stale. Tool calls fail halfway through. Your routing layer needs:
- circuit breakers
- provider failover
- degraded modes
- asynchronous fallback
- dead-letter queues for event-driven flows
Graceful degradation is underrated. Sometimes the right answer is a rules-only response plus a message that richer analysis is pending.
Product operating model
Treat major intents as products with owners, SLAs, and domain stewards. “Case summarization” is not a prompt; it is an operational capability.
Tradeoffs
There is no free lunch here. Just better bills.
Central routing layer adds governance and consistency
Good.
It also adds another platform dependency
Also true.
If badly designed, it becomes a queue of requests waiting on a central team, or a pseudo-ESB that swallows domain logic. If over-governed, teams will route around it. If under-governed, it solves nothing.
Intent abstraction protects you from model churn
Good.
It may hide useful provider-specific features
Also true.
Sometimes a domain genuinely benefits from a specific tool-calling behavior or model capability. The answer is not to expose raw providers everywhere; it is to allow controlled escape hatches with explicit ownership.
Event-driven enrichment improves resilience
Good.
It introduces eventual consistency
Also true.
Some business users hate waiting for enrichment to arrive. You need clear workflow design and service expectations.
Reconciliation improves trust and learning
Good.
It adds process overhead
Also true.
Every correction loop requires UI, data design, and ownership. But skipping it simply moves the cost into operational confusion.
Failure Modes
Most platforms fail in predictable ways.
1. The platform team captures domain logic
Now every change to claims or onboarding semantics requires a central backlog ticket. Delivery slows. Domain teams bypass the platform.
2. The platform is only a thin provider proxy
You centralize SDK calls but not policy, lineage, context assembly, or reconciliation. You get little benefit and plenty of ceremony.
3. One giant context object
It seems convenient. It becomes a privacy risk and a semantic junk drawer.
4. No distinction between suggestion and fact
Generated output is written directly into systems of record. Corrections happen later, if ever. Trust collapses after the first incident.
5. Metrics without business truth
The dashboard celebrates low latency while operations teams quietly ignore the outputs.
6. Migration stalls at adapters
Teams wrap direct model calls but never move routing policy or reconciliation into the platform. You end up with a prettier version of the old mess.
7. Kafka topics become AI exhaust pipes
The routing layer emits poorly defined “AIResult” events with unclear ownership or semantics. Downstream consumers guess what they mean. That never ends well.
When Not To Use
This pattern is not universal.
Do not build a full inference routing layer when:
- you have a single low-risk use case with limited data sensitivity
- the application is isolated and unlikely to spread
- domain semantics are trivial
- there is no meaningful need for provider switching or audit
- a small team can own the end-to-end workflow without cross-enterprise reuse
A departmental prototype, an internal coding assistant for a single engineering group, or a one-off content generation tool may not deserve this machinery.
Also avoid it if your organization lacks even basic domain boundaries. If “customer data” is still a political argument rather than a managed concept, an AI routing platform will not fix your operating model. It will merely expose its weaknesses.
Architecture cannot compensate indefinitely for organizational ambiguity.
Related Patterns
A few patterns sit close to this one.
API Gateway
Useful at channel ingress, but too shallow for full inference orchestration. Good for authentication and routing, insufficient for context semantics and reconciliation.
Event-Driven Architecture
Essential for asynchronous inference and enrichment. Kafka is often the right backbone when AI should react to business events rather than hijack user requests.
Backend for Frontend
Helpful when AI experiences differ by channel, but it should call intent-based inference capabilities rather than own model orchestration itself.
Strangler Fig Migration
The right migration style for scattered AI integrations. Replace mechanics incrementally while preserving domain workflows.
Domain-Driven Design
Absolutely central. Bounded contexts should define inference inputs and acceptance semantics. Without DDD thinking, AI platforms become integration mud.
Human-in-the-Loop Workflow
Not a concession. A serious architectural component for high-risk or low-confidence routes.
Data Products
Relevant, but context products for inference need stronger attention to timeliness, redaction, and workflow semantics than many generic analytical data products provide.
Summary
Enterprise AI is not primarily a model problem. It is a routing problem shaped by semantics, policy, timing, and correction.
The winning architecture is not one where every application talks directly to a model, nor one where a central AI team hoards domain logic. It is one where domains define meaning and workflows, while a shared inference routing layer handles the mechanics of getting governed context to the right inference path and bringing outputs back safely.
Think in intents, not model calls.
Think in context products, not giant payloads.
Think in Kafka events where enrichment belongs off the request path.
Think in progressive strangler migration, not heroic rewrites.
Think in reconciliation, because generated output is not business truth.
If you do that, the platform becomes something useful and durable. Not an AI fashion accessory, but an operational backbone for probabilistic computing inside a real enterprise.
And that, in the end, is the point.
The model may be clever. The architecture must be wiser.
Frequently Asked Questions
What is cloud architecture?
Cloud architecture describes how technology components — compute, storage, networking, security, and services — are structured and connected to deliver a system in a cloud environment. It covers decisions on scalability, resilience, cost, and operational model.
What is the difference between availability and resilience?
Availability is the percentage of time a system is operational. Resilience is the ability to recover from failures — absorbing disruption and returning to normal. A system can be highly available through redundancy but still lack resilience if it cannot handle unexpected failure modes gracefully.
How do you model cloud architecture in ArchiMate?
Cloud services (EC2, S3, Lambda, etc.) are Technology Services or Nodes in the Technology layer. Application Components are assigned to these nodes. Multi-region or multi-cloud dependencies appear as Serving and Flow relationships. Data residency constraints go in the Motivation layer.