⏱ 19 min read
Most teams begin AI deployment in the wrong room.
They walk into the model room first. They debate GPT versus a fine-tuned classifier, CPU versus GPU, hosted API versus self-managed inference, latency benchmarks, quantization tricks, prompt templates, vector databases, guardrails, and budget curves. All of that matters. But in enterprise systems, it is rarely the first-order problem.
The first-order problem is routing.
Who gets sent to which model? With what context? Under what policy? With which SLA, lineage, and cost boundary? What happens when the fast model disagrees with the slow one? When a request crosses jurisdictions? When customer support asks why two identical-looking cases got different answers? When the model endpoint is healthy but the business outcome is wrong because the wrong data arrived at the wrong stage?
That is the real architecture problem. AI inference, in the enterprise, is less like calling a clever library and more like managing traffic in a city that keeps changing shape under your feet. Roads are added. Some neighborhoods require inspection. Others forbid heavy vehicles. Ambulances get priority. School zones slow everything down. The question is not “which car is best?” The question is “how does traffic move safely through the city?”
That is inference topology.
An inference topology is the deliberate design of how requests, data, context, policies, models, and decisions flow through a system. It treats model invocation as a routing concern grounded in domain semantics, operational constraints, and business accountability. This is where domain-driven design turns from nice whiteboard language into practical engineering. The route a request takes should reflect the domain’s meaning, not the convenience of the model API.
If you miss that, you do not merely get technical debt. You get semantic debt: decisions that are technically executed but business-wise incoherent.
Context
The first generation of enterprise AI deployments usually followed a simple pattern: bolt an LLM onto an application, pass in some text, and return an answer. It worked well enough for demos, internal copilots, and narrowly scoped use cases. Then reality arrived.
Different products needed different quality levels. Some requests were cheap enough for a small model; others needed a frontier model. Certain customers required data residency. Some interactions needed retrieval-augmented generation. Others needed deterministic rules before any model could be called. Customer-facing responses demanded moderation and audit trails. High-volume back-office workflows needed throughput and cost predictability more than eloquence.
Soon the architecture became a patchwork of direct model calls spread across services. Teams hard-coded provider choices in business logic. Prompt construction leaked into application layers. Retry logic was duplicated. Cost controls were inconsistent. Shadow traffic for experimentation was painful. Governance was retrofitted after production incidents. EA governance checklist
This is the predictable result of treating inference as an implementation detail instead of a routing system.
A mature enterprise architecture sees something else. It sees that AI requests are just another category of business work moving across a topology: entering with intent, enriched with context, classified by policy, dispatched to capability, reconciled against outcomes, and observed for drift and failure. In other words, inference belongs in the same architectural family as order routing, payment orchestration, case assignment, and event processing.
That framing changes design choices.
Problem
The core problem is not “how do I deploy a model?” It is “how do I direct a stream of heterogeneous business requests to the right inference path while preserving domain meaning, operational control, and economic sanity?”
That sounds abstract until you see what gets mixed together in real systems:
- multiple model providers
- multiple model sizes and capabilities
- retrieval versus no retrieval
- synchronous versus asynchronous processing
- deterministic policy checks before or after inference
- cost tiers by customer segment
- regional routing for compliance
- confidence thresholds and fallback chains
- human review queues
- reconciliation against downstream truth
- experimentation, shadowing, and gradual rollout
If these concerns are scattered across microservices, product teams end up with brittle coupling. The service that “just needed summarization” now also knows about provider failover, PII redaction, legal hold policy, premium customer SLAs, token budgets, and retry behavior. That service is no longer doing customer support or claims intake or fraud triage. It is accidentally becoming an inference router.
And accidental routers are dangerous.
They hide policy in code, multiply failure modes, and make migration expensive because every service embeds its own assumptions. Worse, they fracture the ubiquitous language. One team says “classification,” another says “triage,” another says “priority prediction,” yet all are routing the same business concept under different names and thresholds. The software still runs. The enterprise stops agreeing with itself.
Forces
Several forces shape inference topology, and they pull in opposite directions.
Domain semantics versus technical convenience
Domain-driven design matters here because not every request is just “a prompt.” A claim adjudication suggestion is not the same thing as a customer email draft. A suspicious transaction explanation is not equivalent to a product description generation. Each carries different meaning, risk, and downstream consequences.
If you flatten everything into generic inference calls, you erase business semantics. Once erased, policy becomes hard to enforce. You cannot cleanly distinguish “recommendation,” “decision support,” and “automated action.” The route should depend on those semantics.
Latency versus quality
The best model is often too slow or too expensive to use everywhere. The fastest model may not be accurate enough for high-risk flows. Enterprises therefore need tiered inference paths: quick triage on a smaller model, escalation to a stronger model for ambiguous cases, perhaps human review for the residue.
This is not unlike fraud systems: most transactions are cleared cheaply; only a minority deserve expensive scrutiny.
Cost versus consistency
One global endpoint is simple. It is also financially reckless. Routing low-value tasks to premium models is an invisible tax that becomes very visible on the cloud bill. Conversely, aggressive cost optimization can produce inconsistent outcomes if routing thresholds are unstable or context is poor.
Central governance versus team autonomy
A central inference gateway can standardize policy, tracing, security, and routing. It can also become a bureaucratic choke point or a monolith in disguise. Give every team total freedom, and you get innovation along with chaos. The right answer is usually a federated model: central policy and shared routing primitives, with bounded context-specific orchestration at the edge.
Synchronous flows versus event-driven resilience
User-facing interactions often need synchronous responses. Batch enrichment, document processing, and decision support pipelines usually benefit from asynchronous event-driven patterns. Kafka becomes relevant because many AI workflows are not single calls; they are streams of inference work with retries, dead-lettering, versioned schemas, and reconciliation against later truth. event-driven architecture patterns
Determinism versus probabilistic behavior
Enterprises are built on auditability. AI is probabilistic. That tension never disappears; you manage it. Good architecture wraps probabilistic outputs inside deterministic process boundaries: policy checks, thresholds, idempotent command handling, immutable event logs, and explicit status transitions.
Solution
The solution is to model inference as a routing layer with explicit topology, not as ad hoc model calls buried in services.
That means creating a first-class architectural capability responsible for:
- understanding request intent in domain terms
- enriching requests with the right business context
- applying policy and compliance filters
- selecting an inference path
- executing model calls and retrieval steps
- handling fallback, retries, and escalation
- reconciling outputs with downstream truth
- emitting events, telemetry, and lineage
Call it an inference router, AI orchestration layer, model gateway, or decision fabric. The name matters less than the design principle: separate domain intent from model execution while keeping the domain semantics visible.
The best implementations follow a few rules.
First, route by business meaning, not by provider feature. “Claims severity estimation” should be a distinct capability with its own contract, thresholds, and audit path. It should not leak “use provider X with temperature Y.”
Second, make topology explicit. Requests should move across named stages: intake, enrichment, policy, dispatch, post-processing, reconciliation. Hidden hops create hidden accountability gaps.
Third, preserve event history. If the system suggested an action, escalated to human review, or switched models due to latency, that story must be reconstructable. Enterprises do not only need outputs. They need narratives they can defend.
Fourth, build for migration. You will not get the topology right in one shot. Providers change. Models decay. New regulations appear. Teams discover better domain boundaries. The architecture should allow routes to be strangled and replaced progressively.
Architecture
At a high level, the architecture separates business services from inference execution through an explicit routing plane.
This simple picture hides an important point: the inference router is not the domain. The domain service still owns the business workflow. The router owns model path selection and execution mechanics. That boundary is crucial.
In domain-driven design terms, the business capability lives in bounded contexts such as Claims, Customer Support, Fraud, or Supply Chain. Each context defines its own ubiquitous language: what a “recommendation” means, what confidence threshold is acceptable, what data can be used, and whether the output is advisory or operational. The inference router is a supporting subdomain or platform capability. It should not absorb business rules that belong in the domain.
A practical contract might look like this in spirit:
Intent: classify_document_for_claimsRiskLevel: mediumCustomerTier: enterpriseJurisdiction: EUSLA: 1200msInputArtifactRef: document IDAllowedDataScopes: claim metadata, policy textOutcomeType: recommendation
That contract says more than a prompt ever will. It gives the router enough semantic structure to choose a route.
Domain semantics discussion
This is where many teams either become serious architects or remain prompt plumbers.
A request should carry domain semantics that influence routing. Consider three superficially similar tasks:
- summarize customer complaint
- determine whether complaint meets regulatory escalation criteria
- draft a response to the complaint
All involve text. All may use language models. But they are not interchangeable.
The first is an informational condensation task. The second is a compliance-sensitive classification task with legal consequences. The third is customer communication generation with tone and brand implications. They deserve different models, different controls, and possibly different operational paths. If you collapse them into “LLM call,” you have already lost the architecture.
This is why ubiquitous language matters. Name the capabilities according to business meaning, then map them to routing policies. The architecture should reflect the business taxonomy of inference.
Event-driven inference topology
For many enterprise use cases, Kafka or an equivalent event backbone is the right spine. Not because event-driven systems are fashionable, but because inference work often benefits from durable queues, replay, backpressure handling, and asynchronous reconciliation.
This pattern earns its keep when the truth arrives later. A fraud suspicion score, a support intent classifier, or a claims severity estimate can be compared against actual downstream outcomes. That later truth feeds reconciliation. Reconciliation is the adult supervision of AI systems. Without it, the routing layer slowly drifts into superstition.
Reconciliation discussion
Reconciliation is not an optional reporting feature. It is a control loop.
The router chose a path. A small model handled most requests. A larger model saw the ambiguous ones. Human reviewers handled exceptions. Fine. But did the route produce the desired business result? Did “low-risk auto-approval recommendations” correlate with later chargebacks? Did support triage reduce handling time without increasing complaint reopen rates? Did the summary model omit facts that later mattered in legal review?
Enterprises should record three things:
- the route taken
- the inference output and confidence
- the eventual business truth or proxy outcome
Only then can routing policies be tuned intelligently. Reconciliation also reveals failure concentration. Perhaps the model is fine; the context enrichment is poor for one product line. Perhaps one region’s documents trigger OCR degradation. Perhaps a new business process changed the meaning of “priority” without the router being updated. The topology must be measurable against the business, not just against the model.
Migration Strategy
No sensible enterprise should attempt a big-bang rewrite of all AI integrations into a grand inference platform. That is how architecture becomes theater.
Use a progressive strangler migration.
Start by identifying direct model calls scattered across services. Group them by business intent, not by technology stack. You may find ten “classification” implementations that are really three domain capabilities and seven copies of confusion.
Then introduce a thin routing facade in front of one or two high-value use cases. Do not build a universal control tower on day one. Build a small, opinionated slice that centralizes policy, observability, and route selection for a bounded context where inconsistency already hurts.
A typical sequence looks like this:
Shadow traffic is your friend. Let legacy and new routes run in parallel for selected flows. Compare outputs, latency, cost, and downstream outcomes. Then cut over by tenant, geography, product line, or request type. This is classic strangler fig thinking applied to inference topology.
A few migration rules are worth stating bluntly:
- Do not migrate by provider first. Migrate by capability.
- Do not expose raw prompts as stable contracts.
- Do not centralize all business logic in the router.
- Do not skip lineage and tracing because “we’ll add it later.”
The strangler pattern also applies inside the router itself. Start with policy and provider abstraction. Later add adaptive routing, confidence gating, and reconciliation-driven tuning. You are not building a cathedral. You are creating a road network while traffic is still moving.
Enterprise Example
Consider a large insurer handling claims across auto, property, and travel products in North America and Europe.
The insurer began with separate AI initiatives. Customer support used a hosted LLM for summarizing inbound emails. The claims intake team used a fine-tuned classifier to categorize uploaded documents. Fraud operations used another vendor for anomaly explanations. Each team moved quickly. Each embedded its own model calls inside microservices. Each had different redaction logic, different retry policy, different observability, and no common event lineage. microservices architecture diagrams
Then two things happened.
First, costs spiked. A premium model intended for complex adjuster assistance was inadvertently being used for routine inbound triage because one service had copied another’s integration. Second, compliance found that some EU-origin documents were occasionally processed through a non-EU endpoint during failover. Not often. Just enough to become a board-level conversation.
The insurer responded by introducing an inference topology around the Claims and Customer Communications bounded contexts.
Claims Intake defined explicit capabilities:
- document triage
- severity recommendation
- missing information extraction
- fraud signal explanation
Customer Communications defined separate capabilities:
- complaint summarization
- regulatory escalation screening
- response drafting
These were not model endpoints. They were business capabilities with contracts, data scope rules, and outcome types.
A central routing layer enforced geography, customer segment cost policies, redaction, and telemetry. Kafka topics carried asynchronous work for document-heavy flows. Synchronous APIs remained for customer-facing interactions with strict latency budgets. Smaller local models handled routine triage. Ambiguous or high-risk cases escalated to larger models. Regulatory escalation screening always passed through deterministic rules after model output, because a probabilistic answer was not allowed to be the final word.
Most importantly, they built reconciliation. Severity recommendations were compared later with actual handling effort and settlement patterns. Complaint escalation predictions were checked against compliance review outcomes. Routing policies changed based on those results.
The result was not magic. Latency improved for common flows. Costs became governable. Auditability became possible. And migration continued incrementally: one capability at a time, one region at a time. That is what good enterprise architecture looks like. Not purity. Control with motion.
Operational Considerations
An inference topology lives or dies by operations.
Observability
You need tracing across route stages, not just model metrics. A useful trace includes:
- domain intent
- route selected
- enrichment sources used
- model and version
- token usage
- latency by stage
- fallback invocations
- post-processing decisions
- reconciliation outcome later
If all you can see is endpoint p95, you are driving at night with the headlights off.
Versioning
Version models, prompts, routing policies, and contracts separately. They change at different rates. Conflating them makes rollback messy. A routing policy change may be riskier than a prompt tweak. Treat configuration as deployable, reviewable architecture.
Data governance
Inference is a data movement problem as much as a computation problem. Track data residency, retention, masking, and access scopes. The router should know what context is permissible for each intent. This is where many “generic AI platforms” fail: they optimize invocation but neglect data semantics.
Resilience
Design for partial failure:
- provider outage
- vector store latency
- policy engine timeout
- malformed context payloads
- duplicate events in Kafka
- stale feature data
- dead-letter accumulation
Idempotency is essential for event-driven flows. So is clear degradation policy. Sometimes the right fallback is a smaller model. Sometimes it is a deterministic rules engine. Sometimes it is “defer and queue for human review.” Graceful degradation should be intentional, not improvised during an incident.
Security and abuse prevention
Prompt injection is only one threat. Also consider cross-tenant leakage, over-broad retrieval, poisoned context, quota abuse, and route manipulation through crafted inputs. The router is a control point. Use it.
Tradeoffs
This style of architecture is not free.
A dedicated routing layer adds another moving part. If done badly, it becomes a bottleneck and a political battleground. Teams may complain that central architecture is slowing delivery. Sometimes they will be right. There is a real tradeoff between standardization and flow.
It also introduces abstraction risk. A too-generic router can erase useful domain distinctions. The answer is not to avoid the router; it is to keep domain semantics at the contract boundary and avoid inventing a universal “inference request” blob that means everything and nothing.
Event-driven topologies improve resilience and auditability but can complicate user-facing flows and increase cognitive load. Synchronous orchestration is simpler for some tasks but brittle at scale. Hybrid systems are usually necessary, which means architects must tolerate some mess.
Reconciliation is powerful but expensive. It requires later truth data, linking keys, and discipline across systems. Some organizations want adaptive routing without paying the measurement tax. They are asking for astrology.
Failure Modes
A few failure modes appear repeatedly.
The generic AI gateway trap
A central team builds a gateway that standardizes provider access but ignores domain meaning. Soon every business service sends vaguely structured prompts. Governance exists, but semantics do not. This creates a neat technical facade over a conceptual swamp. ArchiMate for governance
Policy leakage into product services
Teams bypass the router “temporarily” for one urgent use case. Soon policy checks, routing thresholds, and redaction logic are copied into product code. You now have dual control planes, and incidents will exploit the gap between them.
Over-centralization
The router begins to own business decisions that belong in bounded contexts. It becomes a giant coordination hub requiring every team to wait on a platform backlog. Inference routing should be centralized enough to govern, decentralized enough to preserve domain ownership.
No reconciliation loop
The system routes requests based on confidence scores and cost rules, but no one checks downstream truth. Routing quality decays quietly. Teams tweak prompts to chase symptoms.
Hidden fallback chains
A request times out on provider A, falls back to provider B, loses retrieval context, and returns a plausible but wrong answer. Because the fallbacks are hidden, support cannot explain the discrepancy. Every fallback should be visible and auditable.
When Not To Use
Do not build a full inference topology if your use case is a contained, low-risk internal productivity tool with a single model path and limited data sensitivity. A simple integration may be enough.
Do not introduce Kafka and asynchronous orchestration for a narrow interactive feature that only needs sub-second generation and has no meaningful reconciliation loop. You will buy complexity without payoff.
Do not centralize routing if your organization lacks even a basic domain model for AI use cases. If teams cannot agree on the difference between recommendation, automation, and analysis, the topology will mirror the confusion. First fix the language.
And do not over-engineer for provider portability if your real constraint is weak product fit. Abstraction cannot rescue a use case that does not produce business value.
Related Patterns
Several architecture patterns sit close to inference topology.
API Gateway and Backend for Frontend
Useful for channel concerns, but insufficient for domain-aware inference routing. They handle access and shaping, not semantic dispatch.
Service Mesh
Good for network-level traffic management, security, and observability. Not enough for business-level routing decisions about models, context, and policy.
Rules Engine
Often complementary. Deterministic rules are excellent for guardrails, eligibility, and post-inference validation. They should not be confused with the whole routing problem.
Saga and Process Manager
Relevant for long-running AI-assisted workflows, especially where asynchronous steps and human review are involved.
Event Sourcing
Helpful where reconstructability and audit trails matter, though not required everywhere. At minimum, immutable inference audit events are valuable.
Strangler Fig Pattern
Essential for migration. Replace direct model calls capability by capability, not all at once.
Anti-Corruption Layer
Very useful when wrapping external model providers so provider-specific semantics do not pollute domain contracts.
Summary
AI deployment in the enterprise is not primarily a model hosting challenge. It is a data routing challenge shaped by domain semantics, policy, economics, and operational reality.
That is why inference topology matters.
Treat model invocation as a first-class routing concern. Define capabilities in business language. Keep bounded contexts in charge of business meaning. Use a routing layer to enforce policy, select model paths, manage fallbacks, and emit lineage. Prefer progressive strangler migration over platform big-bang rewrites. Use Kafka and event-driven patterns where durability, replay, and reconciliation matter. Build the reconciliation loop early, because systems that cannot compare inference routes to business truth are systems that slowly stop learning.
The memorable line here is simple: the hard part of AI in the enterprise is not making models smart. It is making traffic sensible.
Do that well, and model choice becomes an optimization problem.
Do it badly, and every clever model you deploy will simply help the organization make inconsistent decisions faster.
Frequently Asked Questions
What is enterprise architecture?
Enterprise architecture aligns strategy, business processes, applications, and technology in a coherent model. It enables impact analysis, portfolio rationalisation, governance, and transformation planning across the organisation.
How does ArchiMate support architecture practice?
ArchiMate provides a standard language connecting strategy, business operations, applications, and technology. It enables traceability from strategic goals through capabilities and services to infrastructure — making architecture decisions explicit and reviewable.
What tools support enterprise architecture modeling?
The main tools are Sparx Enterprise Architect (ArchiMate, UML, BPMN, SysML), Archi (free, ArchiMate-only), and BiZZdesign. Sparx EA is the most feature-rich, supporting concurrent repositories, automation, scripting, and Jira integration.