⏱ 20 min read
Validation is where architectural idealism goes to die.
On the whiteboard, microservices are clean: each service owns its data, each bounded context is crisp, each team moves independently. Then a real business rule arrives. _A claim cannot be approved unless the member is active, the provider is credentialed, the policy has remaining coverage, and the fraud score is below threshold._ Suddenly the boundaries that looked so elegant now feel like customs checkpoints on a crowded border.
This is not a minor inconvenience. Cross-service validation sits directly on the fault line between autonomy and consistency. Push too much validation into one place and you build a distributed monolith with HTTP instead of method calls. Push too little and you let invalid business actions leak into the system, leaving reconciliation teams to clean up the mess later. Every enterprise that adopts microservices eventually finds itself here. microservices architecture diagrams
The mistake is to treat validation as a technical plumbing concern. It is not. Validation is a domain concern wearing technical clothes. It expresses policy, authority, timing, ownership, and trust. In domain-driven design terms, the central question is not “which API should I call?” but “which bounded context is allowed to say this action is valid?” If you do not answer that clearly, your architecture will answer for you, usually with timeouts, duplicated logic, and a monthly incident review.
This article lays out a practical architecture for cross-service validation in microservices: when to validate synchronously, when to use local projections and asynchronous checks, how Kafka changes the shape of the solution, how to migrate from a monolith without detonating your operating model, and where this pattern fails. The goal is not theoretical purity. The goal is a system that works on a Tuesday afternoon when three downstream services are slow and the business still expects orders, claims, payments, and approvals to keep moving.
Context
In a monolith, validation tends to hide in plain sight. It lives in service classes, database constraints, workflow engines, or ugly-but-effective stored procedures. The transaction boundary is local. Reading five tables to make a decision is boring, and boring is good.
Microservices remove that comfort. Data is no longer local by default. Ownership matters. One service cannot simply peek into another service’s database without breaking the social and technical contract. And yet the business process still needs a single decision: can this operation proceed?
That creates a classic enterprise tension. The business thinks in end-to-end outcomes. The architecture thinks in bounded contexts. The validation rule straddles both.
Consider a retail order platform:
- Order Service owns the order lifecycle.
- Customer Service owns customer status and credit profile.
- Inventory Service owns stock availability.
- Pricing Service owns promotions and final price eligibility.
- Fraud Service owns risk decisions.
The business rule “accept order” depends on all of them. But it does not mean all of them jointly own the decision. That distinction matters more than most teams realize.
A good architecture starts by separating domain validation semantics into categories:
- Invariant validation
Rules that must hold within a single bounded context.
Example: an order cannot have a negative quantity.
- Reference validation
Checks that another entity exists or is in an acceptable state.
Example: customer exists and is active.
- Policy validation
Cross-context rules where one context depends on another context’s published policy or decision.
Example: fraud score must be below threshold.
- Temporal validation
Rules that are true only at a point in time and may later change.
Example: inventory was available at reservation time.
These are not the same problem, and they should not be solved the same way. Yet many implementations treat all validation as synchronous API orchestration. That is a tax on performance, resilience, and team autonomy.
Problem
Cross-service validation becomes hard because the enterprise wants three things at once:
- independent deployability
- strong domain ownership
- immediate correctness across distributed state
You can get some combination of these. You do not get all of them for free.
The naive design is familiar: the initiating service receives a command and makes a chain of synchronous REST calls to validate every dependent rule before committing. It feels straightforward. It also creates tight runtime coupling, poor latency, cascading failure risk, and hidden business fragility. The Order Service may “own” the order, but operationally it now depends on the health, response time, version compatibility, and semantics of four other services.
That is not service autonomy. That is a distributed committee meeting.
Worse, teams often duplicate validation logic in multiple services “for safety,” creating semantic drift. One service checks customer status as ACTIVE; another also allows PENDING_REVIEW; a third caches a stale status for 15 minutes. The result is not a robust system. It is a system that disagrees with itself.
The deeper problem is usually one of misplaced authority. The architecture does not know the difference between:
- a service asking another service for facts it owns,
- a service applying a policy locally based on published facts,
- and a service requiring an authoritative decision from another domain.
Those three interaction styles look similar in sequence diagrams. In production they behave very differently.
Forces
There are several forces in tension here, and good architecture is mostly the art of deciding which force gets to win in which context.
1. Domain authority vs process convenience
If Customer Service owns customer status, it should be the source of truth. But that does not mean every workflow should call Customer Service live. Ownership of truth is not the same as ownership of every read path.
DDD helps here. The bounded context that owns the concept defines its semantics. Other contexts may consume those semantics through events, replicated views, or explicit policy decisions.
2. Consistency vs availability
Synchronous validation offers stronger point-in-time correctness. It also reduces availability because each dependency becomes part of the request path.
Asynchronous validation with Kafka and local projections improves availability and throughput, but introduces staleness and reconciliation work. There is no escaping this tradeoff. Anyone promising “real-time loosely coupled strongly consistent microservices” is selling a fairy tale. event-driven architecture patterns
3. Latency vs correctness depth
A checkout flow can tolerate perhaps a few hundred milliseconds. A commercial lending approval might tolerate seconds. A batch settlement process can tolerate minutes. Validation design must respect the business tempo.
4. Team autonomy vs central governance
A central validation engine can standardize rules. It can also become a platform bottleneck and a disguised monolith. Decentralized validation preserves autonomy but risks inconsistency unless domain semantics are explicit and shared carefully.
5. Regulatory traceability
In finance, insurance, healthcare, and telecom, “why was this accepted?” matters as much as “was it accepted?” Validation architecture must support auditability, evidence capture, and replay of decision context.
6. Migration reality
Most enterprises are not starting greenfield. They have a monolith, some packaged systems, a few APIs, and a lot of tribal knowledge. Cross-service validation patterns must support progressive strangler migration, not demand an overnight rewrite.
Solution
The most effective pattern is not one mechanism but a validation strategy stack. Use different validation approaches depending on the semantics of the rule and the cost of being wrong.
The stack usually looks like this:
- Validate local invariants inside the owning service
- Consume authoritative domain events into local read models for fast reference checks
- Call external services synchronously only for truly authoritative, time-sensitive decisions
- Use asynchronous reconciliation for rules that cannot be guaranteed at request time
- Record validation evidence so decisions can be explained and replayed
That sounds obvious. It rarely gets implemented cleanly.
A useful heuristic is this:
> If the data is stable enough to be published as a fact, copy it.
> If the decision is contextual and time-sensitive, ask for it.
> If the world can still change after acceptance, reconcile it.
That is the heart of cross-service validation architecture.
Validation modes
Mode 1: Local validation
The initiating service validates what it owns. No debate here.
Mode 2: Projection-based validation
The service maintains a local materialized view of facts from other contexts via Kafka or another event backbone. It uses these for eligibility checks without synchronous calls.
Examples:
- customer active status
- provider credential state
- product sellability flags
- account closure status
This is often the highest-leverage move in a microservices estate. It removes runtime coupling while preserving semantic ownership.
Mode 3: Authoritative decision call
For certain checks, facts are not enough. The other service must make the decision because it owns live models, proprietary logic, or regulated scoring.
Examples:
- fraud authorization
- real-time credit exposure
- payment authorization
- sanctions screening
These belong on synchronous request paths, but sparingly.
Mode 4: Post-acceptance reconciliation
Sometimes the system accepts a request based on best available information, then confirms or compensates later.
Examples:
- inventory backorder after eventual stock reconciliation
- healthcare claims pended for manual review
- quote accepted subject to underwriting confirmation
This is not “eventual consistency” as hand-waving. It is an explicit business process with statuses, deadlines, compensations, and operational ownership.
Architecture
A robust architecture separates validation orchestration from domain authority.
The initiating service may coordinate the flow, but each rule should map clearly to a source of truth and a validation mode.
A few things matter here.
First, the local read models are not a cache in the lazy sense. They are purpose-built projections shaped for the consuming bounded context. They contain only the externally owned facts this context is allowed to rely on. That subtlety matters. A projection is a domain integration artifact; a cache is usually an implementation convenience.
Second, the Validation Orchestrator can live inside the initiating service or as a separate component. My bias is to keep it inside the domain service unless the process spans multiple aggregate roots or long-running workflows. Separate orchestration services are easy to invent and hard to retire.
Third, every validation rule should declare:
- source of authority
- freshness requirement
- fallback behavior
- evidence to store
- compensation path if later contradicted
Without that, operations will end up discovering the real rules during incidents.
Domain semantics and anti-corruption
Cross-service validation often fails because one service imports another service’s internal model wholesale. Customer “status” is a notorious offender. The source service may have ten internal states. The consuming domain may only care about three categories: eligible, ineligible, unknown.
This is where DDD earns its keep. Use an anti-corruption layer or translation model so external facts are mapped into the local domain language. Do not spray foreign enums through your codebase and call it integration.
For example:
- Customer Service publishes
ACTIVE,SUSPENDED,DECEASED,MERGED,PENDING_KYC - Order Service translates to
ELIGIBLE,BLOCKED,REVIEW_REQUIRED
That protects the domain from semantic churn and clarifies what validation actually means in this context.
Event-driven validation with Kafka
Kafka is particularly useful here because it supports durable event streams, replay, consumer isolation, and local projection building. It is not magic, but it is a good fit for validation facts that change over time and must be consumed by multiple services.
Two cautions.
One, do not put raw database change events on Kafka and call it domain integration. Validation semantics need meaningful business events. customer_status_changed is useful. row_updated is not.
Two, local projections need versioning, replay support, and monitoring for lag. A stale projection is a hidden failure mode. If your architecture depends on local validation reads, then consumer lag is not just an observability metric. It is a business risk indicator.
Validation evidence
For regulated or high-value workflows, store the validation evidence used at decision time:
- event version or projection timestamp
- external decision IDs
- rules evaluated
- key facts used
- result and reason codes
This is invaluable for audits, disputes, and replay. It also allows controlled revalidation when upstream semantics change.
Migration Strategy
Most enterprises arrive here from a monolith that already contains cross-domain validation logic buried in a transaction script, rules engine, or god service. You do not fix that by extracting six services and wiring synchronous calls between them. That merely preserves coupling while adding network failure.
A better path is progressive strangler migration.
Start by identifying the validation rules in the monolith and classifying them:
- local invariant
- reference fact
- policy decision
- temporal check
- reconciliation candidate
This inventory is usually eye-opening. Many “hard” cross-service validations turn out to be reference facts suitable for publication and local projection.
Then migrate in stages.
Stage 1: Make validation explicit in the monolith
Before splitting services, expose the validation rules, sources, and outcomes as explicit components and logs. If you cannot describe the rule cleanly in the monolith, you will not distribute it safely.
Stage 2: Publish domain events from the monolith
Use the strangler approach to emit authoritative business events from the existing system. Build downstream projections in new services. This allows consumers to validate using local views before the source system is fully decomposed.
Stage 3: Extract read-driven bounded contexts
Move services whose primary need is consuming facts rather than making authoritative decisions. They can use Kafka-fed projections and remain decoupled from the monolith.
Stage 4: Extract decision-owning services
Move contexts that must make live authoritative decisions, such as fraud, underwriting, or payment authorization. Introduce synchronous APIs only where the domain really demands them.
Stage 5: Introduce reconciliation workflows
For validations that cannot be made fully synchronous without harming resilience, add pending states, compensations, and operational dashboards.
This migration sequence matters because it preserves business continuity while steadily reducing coupling.
A common failure is extracting the API before extracting the semantics. Teams carve out a “Customer Service,” but consumers still do live lookups for everything because no one published domain events or clarified what facts could be copied safely. The new architecture looks modern in PowerPoint and behaves like remote procedure calls over a slow network.
Enterprise Example
Take a healthcare payer processing claims.
A claim adjudication request may depend on:
- member eligibility
- provider network participation
- policy coverage limits
- prior authorization status
- fraud indicators
- benefit accumulators
- clinical coding rules
No serious payer can adjudicate all this through a chain of live calls on every claim line. The volume is too high, the latency budget too tight, and the failure blast radius too large.
A practical architecture works like this:
- Member Service publishes eligibility and plan enrollment events.
- Provider Service publishes credential and network participation changes.
- Benefits Service publishes coverage policy and accumulator snapshots or deltas.
- Claim Service consumes these events into adjudication projections.
- Fraud Service remains an authoritative real-time decision service for suspicious or high-value claims.
- Prior Authorization Service may be queried live only for specific procedure types where current approval state is essential.
The Claim Service validates most claims against local projections. It sends only a subset to real-time services. If a downstream contradiction emerges later—for example, a retroactive eligibility cancellation—the claim enters reconciliation: adjust payment, pend future claims, notify finance, and create an audit trail.
That is what enterprise architecture looks like in the real world: not perfect consistency, but controlled inconsistency with explicit business handling.
This pattern is common beyond healthcare.
In banking:
- account status and product entitlements can be projected locally,
- sanctions and fraud often remain live decision calls,
- settlement exceptions are reconciled later.
In retail:
- catalog and customer state can be projected,
- payment auth is synchronous,
- inventory truth may require reservation and later reconciliation for edge cases.
In telecom:
- subscriber status and product eligibility can be projected,
- credit and provisioning checks may combine local facts with live authority calls,
- failed activations require compensating workflows.
Operational Considerations
Cross-service validation is an operational architecture as much as an application architecture.
Observability
You need visibility into:
- validation latency by rule and dependency
- Kafka consumer lag for critical projections
- stale-read age of local projections
- rejection, pending, and override rates
- reconciliation backlog and aging
- dependency timeouts and circuit breaker open states
A service can appear healthy while making bad validation decisions from stale data. Traditional uptime metrics will not catch that.
Freshness policies
Not every validation fact needs the same freshness.
Define explicit SLAs:
- customer status projection max age: 60 seconds
- pricing eligibility max age: 5 minutes
- fraud decision: live only
- coverage accumulator: same-day acceptable for pre-check, live required for final adjudication
This is architecture turning business tolerances into runtime contracts.
Idempotency and replay
Kafka consumers building projections must be idempotent. Duplicate events happen. Replays happen. Out-of-order events happen if you designed partitions poorly or modeled versioning carelessly.
Validation systems must survive replay because replay is how you rebuild projections, recover from bugs, and answer audit questions.
Schema and semantic versioning
Schema evolution is manageable. Semantic evolution is harder.
If Customer Service changes what “ACTIVE” means, your contract has changed even if the JSON schema did not. This is why cross-service validation needs product thinking around shared domain events, not just API governance. EA governance checklist
Human override and case management
In enterprise workflows, some validations fail into a manual path. That path needs first-class design:
- reason codes
- evidence captured
- override authority
- expiry and revalidation
- audit logging
If the architecture ignores manual operation, the business will rebuild it in spreadsheets and inboxes.
Tradeoffs
There is no universal answer, only conscious compromises.
Projection-based validation
Pros
- low latency
- reduced runtime coupling
- better resilience
- scalable for high-volume workflows
Cons
- eventual consistency
- stale data risk
- projection maintenance overhead
- need for reconciliation
Synchronous authoritative validation
Pros
- stronger point-in-time correctness
- clear ownership of decision
- simpler semantics for regulated decisions
Cons
- latency and availability dependency
- cascading failures
- tighter runtime coupling
- hard to scale across many checks
Central validation service
Pros
- consistent rule execution
- single integration point
- easier audit in some cases
Cons
- can become a bottleneck
- often erodes bounded contexts
- teams queue behind a central platform
- hidden monolith risk
My bias is simple: centralize governance of validation contracts, not all validation execution. Keep rule authority where the domain lives. ArchiMate for governance
Failure Modes
This area has very predictable ways to fail.
1. Distributed monolith validation
Every request fans out to five services. One slows down, all slow down. Circuit breakers start tripping. Business transactions fail for reasons unrelated to actual domain validity.
2. Semantic duplication
Multiple services implement “customer eligibility” differently. The system becomes internally inconsistent and impossible to reason about.
3. Stale projection blindness
Teams trust local read models but do not monitor lag or freshness. Validation silently degrades.
4. Fake event-driven architecture
Services emit low-level CRUD events that lack domain meaning. Consumers reverse-engineer business semantics from table changes. This is brittle and miserable.
5. No reconciliation ownership
The architecture assumes eventual consistency but no team owns the exception workflows. The result is orphaned transactions, finance discrepancies, and ugly manual cleanup.
6. Over-centralized rules engine
A shared platform team creates a global validation engine. At first it looks elegant. Then every domain nuance gets shoved into generic metadata. Soon nobody can change a rule without a cross-team negotiation and a regression scare.
7. Missing evidence
A decision is made, but the system cannot prove which facts were used. Audits become archaeology.
When Not To Use
Do not reach for elaborate cross-service validation patterns when the problem does not deserve them.
Use a monolith when the domain is still fluid
If business rules are changing weekly and the team is small, splitting validation across services early is self-inflicted pain.
Avoid local projections when decisions must be exact to the millisecond
For real-time trading exposure checks or payment authorization holds, stale data may be unacceptable. Use an authoritative synchronous path.
Avoid synchronous fan-out for high-volume commodity checks
If every low-value transaction requires three real-time calls, you are building fragility into the critical path.
Avoid central orchestration services for simple domains
If one service mostly owns the workflow, embedding orchestration there is usually cleaner than introducing a separate “validation platform.”
Avoid event-driven replication without domain event maturity
If the source team cannot publish stable, meaningful business events, projection-based validation will become an exercise in guesswork.
Related Patterns
Cross-service validation overlaps with several important patterns.
Saga
Useful when validation and subsequent actions span multiple services with compensations. Validation may be an early stage of a larger long-running business process.
CQRS
Projection-based validation often uses CQRS-style read models. The write model remains authoritative in the source context; consumers build read-optimized local views.
Anti-Corruption Layer
Essential when translating external domain facts into local validation semantics.
Policy Decision Point / Policy Enforcement Point
Common in security and compliance domains. A central policy decision model can be useful, but only when policy is genuinely cross-cutting and not domain-specific business logic in disguise.
Outbox Pattern
Critical for reliably publishing validation-relevant domain events from transactional systems during migration and beyond.
Reconciliation and Compensating Transactions
Not glamorous, but indispensable. When validation cannot be final at acceptance time, these patterns turn inconsistency into managed process.
Summary
Cross-service validation is one of the places where microservices architecture stops being a diagram and becomes a discipline.
The right answer is rarely “just call the other service.” Nor is it “make everything asynchronous.” The real design task is to decide which service owns the semantics of validity, which facts can be copied safely, which decisions must remain authoritative and live, and which contradictions must be handled through reconciliation.
That is classic domain-driven design thinking: bounded contexts define meaning, integration respects authority, and workflows acknowledge reality instead of pretending distributed consistency is free.
If you remember only one idea, make it this:
> Validation is not about where the code runs. It is about who is allowed to say yes.
Build local projections for reference facts. Reserve synchronous calls for genuine live authority. Add explicit pending and reconciliation states where the business can tolerate them. Store evidence. Monitor freshness. Migrate progressively with a strangler approach instead of replacing one monolith with a network of smaller ones.
Microservices do not remove validation complexity. They reveal it. Good architecture does not hide that truth. It gives the business a system that can live with it.
Frequently Asked Questions
What is a service mesh?
A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.
How do you document microservices architecture for governance?
Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.
What is the difference between choreography and orchestration in microservices?
Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.