⏱ 22 min read
There is a particular kind of enterprise lie that shows up on status dashboards.
Every service is green. Every pod is healthy. Kubernetes is smiling. CPU is moderate, memory is stable, error rates are low enough to avoid panic. And yet the business capability is down. Customers cannot place orders. Agents cannot issue refunds. Warehouses cannot confirm shipments. The system is “healthy” in the way a city is healthy because all its traffic lights still have electricity while every major road is blocked.
This is the central mistake in service health propagation: we confuse local liveness with operational truth.
A microservice landscape is not a bag of independent executables. It is a network of promises. One service promises to price a quote, another to reserve inventory, another to authorize payment, another to emit compliance events, another to update CRM, and somewhere in the middle Kafka carries the nervous system of the estate. When one promise weakens, the others may still run, but the business capability they collectively provide can already be degraded, delayed, or effectively dead. event-driven architecture patterns
That is why simple health checks are not enough. “Can I answer HTTP /health?” is useful, but shallow. Enterprises need something richer: a health graph. Not merely a list of service statuses, but a model of how health propagates through dependencies, bounded contexts, asynchronous flows, and business criticality. A health graph turns isolated technical checks into operational semantics. It answers the question leaders actually care about: what business outcomes are currently safe to execute, degraded, or blocked?
This is not just an observability concern. It is architecture. It touches domain-driven design, dependency management, resilience strategy, migration planning, and operational governance. Done well, it gives operators clarity and gives teams a language to talk about partial failure. Done badly, it becomes another ornamental dashboard that no one trusts during an incident. EA governance checklist
Let’s talk about how to build one that matters.
Context
Microservices created a useful discipline: teams own services, APIs are explicit, deployment is decoupled, and scaling can follow business demand. But the price of local autonomy is global complexity. Once a system becomes a mesh of APIs, events, caches, data products, fraud engines, payment gateways, master data services, and Kafka topics, the question “is the system healthy?” stops being simple.
In monoliths, health was often blunt but honest. If the app was down, everyone knew. If the database was dead, the business capability was down. There was one blast radius.
In microservices, health fragments. A service may be technically up while semantically useless because:
- its upstream dependency is stale,
- its downstream dependency is timing out,
- its Kafka consumer lag is hours behind,
- its circuit breaker is open,
- its read model has not reconciled,
- its “success” responses are producing side effects that are silently failing elsewhere.
This gets worse in event-driven systems. With synchronous request-response, dependencies are easier to reason about: service A calls service B. In Kafka-heavy estates, the chain is longer and less visible. Service A publishes “OrderPlaced”, service B enriches it, service C reserves stock, service D triggers shipping, service E emits ledger entries, and service F updates customer communications. There may be no single point where “health” can be checked directly, but the business process still has a health state.
So the architectural move is to treat health as a graph problem, not a ping problem.
A health graph models:
- services and infrastructure nodes,
- synchronous and asynchronous dependencies,
- criticality of edges,
- domain-level capability status,
- freshness and reconciliation state,
- propagation rules for degradation.
In other words, it makes explicit what is otherwise tribal knowledge in the heads of support teams and principal engineers.
Problem
The naive health model in most microservice platforms has three levels: microservices architecture diagrams
- Liveness: is the process alive?
- Readiness: can it receive traffic?
- Dependency checks: can it reach a database, cache, or broker?
Useful. Necessary. Not sufficient.
The actual problem is that these checks are component-centric while enterprise operations are capability-centric.
A customer does not care whether the Product Catalog service can reach Redis. They care whether they can browse accurate prices and place an order. A claims adjuster does not care whether the Rules service responds in under 200 ms. They care whether claims can be adjudicated within policy and within SLA.
This mismatch creates several recurring failure patterns:
- False green
Individual services report healthy while the end-to-end business process is broken.
- False red
A non-critical dependency is down and alarms fire everywhere, but the business capability is still operating within acceptable degradation.
- Dependency opacity
Teams know their direct calls but not the true runtime dependency chain, especially across Kafka topics, data pipelines, and third-party integrations.
- Staleness blindness
Systems that rely on eventual consistency do not model freshness. A read model may be technically available but dangerously stale.
- No semantic propagation
Health does not distinguish “payments delayed but retriable” from “payments impossible due to acquirer outage”.
- Operational incoherence
Incident response devolves into conference calls where each team says, “our service is healthy,” while the enterprise capability remains unavailable.
The core issue is not lack of telemetry. Most enterprises are drowning in telemetry. The issue is lack of semantic structure.
Forces
A good architecture article is really an argument about forces. This one has plenty.
1. Local autonomy vs global truth
Microservices encourage teams to own and expose their own health. That is good. But if every team defines health differently, the platform cannot form a coherent picture. You need local ownership and shared semantics.
2. Technical health vs domain health
A service can be available while delivering invalid business outcomes. Domain semantics matter. “Quote service responding” is not the same as “pricing capability trustworthy.” This is where domain-driven design earns its keep.
3. Real-time certainty vs eventual consistency
In Kafka-based systems, health is often probabilistic and temporal. The question becomes: how stale is too stale? You need propagation rules that account for lag, replay, reconciliation, and recovery windows.
4. Simplicity vs fidelity
A binary healthy/unhealthy status is easy to understand but operationally useless in complex estates. A fully weighted graph with confidence scoring can become too clever to trust. The design has to be rich enough to reflect reality and simple enough to operate.
5. Platform standardization vs bounded context differences
A payment domain and a customer preference domain do not have the same health semantics. Standardization should focus on the framework for describing health, not forcing identical meaning everywhere.
6. Centralized visibility vs decentralized ownership
A central health graph is valuable, but if it becomes a manually maintained architecture inventory, it will rot. Ownership must stay with service teams, while the graph is assembled centrally from declared contracts and observed runtime facts.
Solution
The solution is to create a service health propagation model, implemented as a health graph, where health is computed from multiple signals and propagated through dependency relationships to produce both technical and business-capability status.
The key shift is this:
> Don’t ask whether a service is up. Ask whether a capability can still keep its promises.
Core concepts
Node types
A health graph usually includes several kinds of nodes:
- Service nodes: deployable units, APIs, stream processors
- Infrastructure nodes: databases, Kafka clusters, caches, gateways
- External dependency nodes: payment providers, identity services, partner APIs
- Capability nodes: business-level functions like “Place Order” or “Issue Refund”
- Data product / read model nodes: materialized views, search indexes, reporting feeds
This matters because capabilities are not services. They span services. DDD helps here: capabilities often align to domain outcomes across bounded contexts.
Edge types
Not all dependencies are equal. Model different edge semantics:
- Synchronous hard dependency: service A cannot complete without B
- Synchronous optional dependency: B enriches the response but is not required
- Asynchronous processing dependency: A publishes events consumed by B
- Freshness dependency: a read model depends on event flow remaining within lag tolerance
- Compensatable dependency: failure is acceptable for now if a compensating process exists
- Control-plane dependency: service startup or scaling depends on another system
The graph is where tradeoffs become visible.
Health states
Avoid simple binary status. A practical set looks like:
- Healthy
- Degraded
- At Risk
- Unavailable
- Unknown
And for data-bearing or event-driven nodes, add metadata:
- freshness age
- consumer lag
- reconciliation backlog
- confidence score
- last successful end-to-end confirmation
Propagation rules
This is the heart of the architecture. Propagation is not just “red flows downstream.” It depends on semantics.
Examples:
- If payment authorization is unavailable, “Place Order” may become Unavailable.
- If customer recommendations are unavailable, “Browse Catalog” may become Degraded.
- If shipment-event consumer lag exceeds 30 minutes, “Track Shipment” may become At Risk.
- If CRM sync is down but a compensating replay exists, “Order Fulfillment” may remain Healthy while “Customer Service Visibility” becomes Degraded.
This is why a health graph is domain work, not just platform plumbing.
Architecture
A practical implementation usually has four layers:
- Signal collection
- Health normalization
- Graph computation
- Capability presentation and actioning
Here is the basic shape.
1. Signal collection
Collect the usual platform signals:
- liveness, readiness
- response latency, error rates
- circuit breaker state
- downstream connectivity
- database replication lag
- Kafka producer and consumer errors
- topic lag and dead-letter volume
But add domain signals. These are often the missing piece:
- number of orders pending payment confirmation beyond threshold
- percentage of claims requiring manual intervention
- reconciliation mismatch count
- age of last successful settlement batch
- stale product price projection beyond SLA
A health graph without domain signals is just infrastructure theater.
2. Health normalization
Each service team may emit signals differently. The platform should normalize into a shared health contract such as:
- current component state
- evidence timestamps
- confidence level
- dependency-specific statuses
- degradation notes
- semantic tags:
critical,optional,stale-ok-15m,compensatable
Think of this as anti-corruption for operational semantics. In DDD terms, the health platform needs a language that can map bounded-context-specific meaning into an enterprise-operable representation without flattening all nuance.
3. Graph computation
The graph engine combines:
- static dependency metadata from service catalog or architecture registry
- runtime discovery from tracing or service mesh
- event topology from Kafka metadata and consumer group mappings
- domain capability mapping from architecture and product models
A simple propagation algorithm often works better than a sophisticated one no one understands. Start with weighted rules:
- hard synchronous dependency unavailable → parent degrades to unavailable
- optional dependency unavailable → parent degrades to degraded
- async dependency lag > threshold → capability becomes at risk
- reconciliation backlog growing and no catch-up trend → capability degrades further
- multiple weak signals may combine into at risk even if none individually is fatal
4. Capability presentation
Operators should see health through business capabilities first, services second.
A good interface answers:
- Can customers place orders?
- Can agents issue refunds?
- Is inventory trustworthy?
- What is degraded?
- What is blocked?
- Which dependency is causing it?
- Is the state getting better or worse?
That final question matters. Static health snapshots are often misleading. Health needs directionality.
Domain Semantics and Bounded Contexts
This is where architecture either becomes serious or remains decorative.
In domain-driven design, bounded contexts exist because words change meaning. “Order accepted” in Sales may mean “customer received confirmation,” while in Fulfillment it means “inventory reserved and dispatch flow initiated.” If your health model ignores these semantic boundaries, propagation will be wrong.
Take a retail platform:
- Pricing bounded context owns price calculation and promotion rules.
- Ordering owns order intake and customer confirmation.
- Payments owns authorization and settlement.
- Fulfillment owns reservation and shipment.
- Customer Care owns agent-facing order visibility.
Now ask: if Payments is down, what is unhealthy?
Not everything.
- “Take order request” may still be possible if orders can be accepted into a pending-payment state.
- “Revenue recognition” is not possible.
- “Shipment release” should remain blocked.
- “Customer care visibility” may remain healthy if order state transitions are still published.
- “Fraud review queue” may become overloaded, creating secondary health concerns.
This is why capability nodes should often be attached to bounded-context semantics, not just technical workflows. The health graph must express the ubiquitous language of the domain. Otherwise it becomes an elegant way to misunderstand the business.
Kafka, Event Flow, and Reconciliation
In synchronous systems, health is immediate. In event-driven systems, health has memory.
Kafka-based estates bring enormous operational power, but they also create a more subtle health model. A producer can be healthy while consumers are lagging. A consumer can be healthy while downstream projections are stale. A topic can be available while one partition is blocked by poison messages. The business process may be delayed but not broken. Or broken but not yet visible.
This is where reconciliation enters the picture.
Reconciliation is the enterprise admission that distributed systems do not always agree in real time. It is not a sign of failure; it is a design pattern for truth recovery. Your health graph should account for:
- event lag,
- dead-letter queue growth,
- replay in progress,
- mismatch between source-of-truth aggregate and read model,
- compensating actions pending.
Here is a typical event-driven propagation model.
Suppose Payment Service is down:
- Order Service may still accept orders.
- Kafka remains healthy.
- Inventory may reserve stock.
- Customer Portal may show “processing.”
- Reconciliation later aligns missing payment outcomes.
Is the capability healthy? Not in a binary sense.
A better view:
- Order Capture: Healthy
- Payment Confirmation: Unavailable
- Order Visibility: Degraded
- Shipment Release: Blocked
- Financial Accuracy: At Risk pending reconciliation
That is the operational truth.
A mature health graph should therefore include freshness SLAs and reconciliation tolerances. If consumer lag is 30 seconds, nobody cares. If lag is 45 minutes during a flash sale, the business absolutely cares.
Migration Strategy
Do not try to build a perfect enterprise health graph in one heroic programme. You will create a committee, produce a taxonomy, and six months later still be arguing about what “degraded” means.
Use a progressive strangler migration.
Start with one capability, one path to operational truth, and let the model earn trust.
Phase 1: Inventory critical capabilities
Identify a small number of business-critical capabilities:
- Place Order
- Take Payment
- Dispatch Shipment
- Issue Refund
Map the services, Kafka topics, external providers, and read models involved. Not every dependency in the estate. Only the ones that matter for these capabilities.
Phase 2: Add semantic health contracts
Have service teams expose normalized health metadata:
- critical dependencies
- optional dependencies
- lag thresholds
- stale tolerances
- compensatable failure declarations
This can sit alongside existing /health endpoints or be emitted via events.
Phase 3: Build first graph and validate against incidents
Construct a graph engine with explicit rules. Keep it simple. Then compare its output against real incident history. This is important. Health models are hypotheses until they survive actual outages.
Phase 4: Strangle dashboard-centric operations
Most enterprises already have dashboards per tool: Prometheus, Grafana, APM, Kafka manager, cloud console, service mesh. Don’t rip them out. Put the health graph in front of them as the operational index. Let teams drill down when needed.
Phase 5: Add reconciliation and freshness semantics
Once synchronous dependencies are modeled, add event lag, dead-letter, replay, and read model freshness. This is usually where the graph becomes genuinely valuable.
Phase 6: Close the loop with automation
Only after operators trust the health graph should you drive actions from it:
- route traffic away from degraded regions,
- disable non-essential features,
- open incident tickets automatically,
- pause downstream consumers,
- trigger replay or reconciliation jobs.
Here is the migration shape.
This strangler approach works because it does not require the entire organization to agree on a grand theory of health before delivering value.
Enterprise Example
Consider a global insurer modernizing claims processing.
The legacy world had a large claims administration platform. Health was ugly but simple: if the main platform or database was unavailable, claims intake was down. Then the organization decomposed capabilities:
- FNOL service for first notice of loss
- Coverage verification
- Fraud scoring
- Document ingestion
- Payment disbursement
- Case management
- Customer communication
- Event backbone on Kafka
At first, operations got worse, not better. Each team had a dashboard. Every team said they were healthy during incidents. Claims still stalled.
A typical outage looked like this:
- FNOL API was up.
- Fraud service was responding.
- Kafka cluster was healthy.
- Payment service was up.
- But a document classification consumer had lagged badly after poison messages.
- Claims requiring documents remained stuck in “pending review.”
- Case management read models were stale.
- Contact center agents could not see the true claim status.
- Customer SMS updates continued, but inaccurately.
The organization introduced a health graph around the “Process Claim” capability.
They defined:
- hard dependencies: policy coverage verification, claims persistence
- soft dependencies: outbound customer notification
- async dependencies: document classification, fraud score enrichment, case management view updates
- freshness thresholds: case view stale > 10 min = degraded; > 30 min = at risk
- reconciliation logic: document processing replay within 2 hours preserves capability as degraded, not unavailable
The result was immediate operational clarity:
- “Claim Intake” remained healthy.
- “Automated Adjudication” became degraded.
- “Agent Claim Visibility” became unavailable in one region.
- “Customer Notifications” remained healthy but flagged as potentially misleading.
That changed behavior. Instead of declaring a general Sev1 outage, the business routed document-heavy claims to manual handling, paused inaccurate customer updates, and prioritized consumer replay over service restarts. Mean time to meaningful response improved more than mean time to recovery, which in many enterprises is the more important metric.
That is what a health graph does. It gives the enterprise a truthful map of a partial failure.
Operational Considerations
A few operational points matter more than architecture diagrams admit.
Ownership
Service teams own the declaration of health semantics for their services. Platform teams own normalization, aggregation, and tooling. Enterprise architecture owns the capability model and policy principles. If one group tries to own all three, the system will fail politically before it fails technically.
Data sources
Use multiple evidence sources:
- service-reported health
- synthetic transactions
- tracing-derived runtime edges
- Kafka consumer lag and DLQ metrics
- business KPI anomalies
- reconciliation job outcomes
If you rely only on self-reported health, you are asking services to mark their own homework.
Time windows
Health needs temporal reasoning:
- immediate outage,
- sustained degradation,
- improving but not recovered,
- stale but reconcilable.
A five-second spike should not take down a capability. A fifteen-minute lag might.
Confidence and unknown state
Unknown is a valid state. During network partitions or telemetry outages, pretending health is green because no red signal arrived is dangerous. Confidence scoring can help, but do not overcomplicate it. Operators need to trust the model quickly.
Governance
Treat dependency metadata as architecture code:
- versioned
- reviewed
- tested
- tied to deployment pipelines
Otherwise your graph becomes a historical novel.
Tradeoffs
Let’s be honest about the tradeoffs.
Benefit: clearer incident management
You get a business-oriented view of outages.
Cost: more modeling work
Teams must describe dependency semantics, not just publish metrics.
Benefit: better prioritization
Not every technical failure is a business emergency.
Cost: semantic disagreements
Teams will argue about criticality, bounded context meaning, and stale tolerances. This is normal and healthy. It is architecture doing its job.
Benefit: safer automation
Traffic shaping and feature disabling can be driven by capability truth.
Cost: risk of overfitting
If rules become too specific to historical incidents, the model becomes brittle.
Benefit: supports event-driven reality
Kafka lag, replay, and reconciliation become first-class concerns.
Cost: not everything is computable
Some capabilities still require human judgment. The graph should support operators, not replace them.
A health graph is a map, not the territory. But it is far better than walking blind.
Failure Modes
There are several ways this pattern goes wrong.
1. Graph as static documentation
If dependencies are manually maintained in a wiki or CMDB, they will be wrong. Runtime discovery and deployment-time metadata need to feed the graph continuously.
2. Only technical checks, no domain semantics
You build a beautiful engine that says nothing useful about actual business capability. This is the most common failure.
3. Binary thinking in an eventually consistent world
Everything is either healthy or unhealthy. Operators quickly learn not to trust it because reality is mostly degraded, delayed, stale, and recoverable.
4. Over-automation
The graph marks a capability as degraded and automatically disables features or reroutes traffic incorrectly, making the incident worse. Start with human-in-the-loop operations.
5. Health contagion
Poor propagation rules turn one local issue into a sea of red. That destroys signal quality and operator trust.
6. Ignoring reconciliation
If the architecture assumes immediate consistency, event-driven systems will look unhealthy too often. Reconciliation windows and compensations must be explicitly modeled.
7. Platform imperialism
A central team mandates one health model for every domain. The result is semantic nonsense. Framework standardization is good; semantic centralization is not.
When Not To Use
A health graph is not always worth it.
Do not use this pattern if:
- you have a small, simple service estate where end-to-end synthetic checks are sufficient,
- your system is mostly monolithic and failures are already obvious,
- your organization lacks basic observability discipline,
- teams do not own their services operationally,
- your architecture is changing too fast for dependency semantics to stabilize,
- you are trying to compensate for fundamentally poor service boundaries.
Also, if your real problem is a distributed monolith with chatty APIs, a health graph will only describe the pain more elegantly. It will not fix the underlying design. Better bounded contexts and fewer runtime dependencies may be the right move.
In short: don’t build a health graph to avoid making architectural decisions you should have made earlier.
Related Patterns
Several patterns sit close to this one.
Service catalog
Provides the inventory and metadata backbone for service and dependency definitions.
Dependency mapping
Static and runtime mapping of who calls whom, including event flow over Kafka.
Synthetic transactions
Validates business journeys directly and serves as a reality check against service-reported health.
Circuit breaker and bulkhead
Help localize failure and influence propagation rules.
Saga and compensation
Important in determining whether failure is recoverable and whether a capability should be degraded or unavailable.
CQRS and read model freshness
Critical when capability health depends on projection timeliness.
Reconciliation pattern
Essential for eventual consistency. It gives the enterprise a path from “temporarily wrong” back to “trustworthy.”
Strangler fig migration
The right way to introduce health graphing incrementally, capability by capability.
Summary
Service health propagation is not a monitoring trick. It is a way of making distributed systems tell the truth.
The old model asks whether individual services are alive. The better model asks whether business capabilities can still keep their promises. That requires a health graph: nodes for services, infrastructure, data products, and capabilities; edges for synchronous, asynchronous, and freshness dependencies; propagation rules grounded in domain semantics; and explicit treatment of reconciliation, lag, and compensatable failure.
This is where domain-driven design matters. Health semantics live inside bounded contexts. “Available” means different things in Payments, Fulfillment, and Customer Care. A good health graph respects those meanings while giving the enterprise a coherent operational view.
Migrate to it progressively. Start with critical journeys. Add dependency semantics. Fold in Kafka lag and read model freshness. Validate against real incidents. Only then automate.
The real payoff is not prettier dashboards. It is better judgment under pressure. During an outage, the enterprise does not need more green lights. It needs an honest map of which promises still hold, which are weakened, and which are broken.
That is what a health graph should provide. Not comfort. Clarity.
Frequently Asked Questions
What is a service mesh?
A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.
How do you document microservices architecture for governance?
Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.
What is the difference between choreography and orchestration in microservices?
Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.