⏱ 21 min read
Most distributed systems don’t fail because we lack data. They fail because we drown in it.
That is the quiet scandal of modern observability. We instrument every service, emit every metric, capture every span, log every state transition, and then act surprised when the bill arrives like a ransom note. Worse, even after paying for this flood, we still miss the thing that mattered: the one ugly transaction that cut across four services, one Kafka topic, an old payment gateway, and a team boundary no one admits exists. event-driven architecture patterns
Sampling is the discipline of choosing what deserves to be seen.
That sounds humble. It is not. Sampling architecture is one of the most consequential design choices in distributed systems because it changes what operators can know, what engineers can debug, what finance will pay for, and what regulators will tolerate. Done badly, sampling becomes institutionalized blindness. Done well, it gives you a coherent picture of system behavior without setting fire to your telemetry budget.
The trick is that sampling is not merely a technical mechanism. It is a domain decision. A checkout failure is not “just another trace.” A fraud-screen timeout is not equivalent to an image resize delay. A patient record access in healthcare carries different meaning than a cache miss in product search. If you don’t model those differences explicitly, your observability architecture will optimize for volume instead of value.
This article lays out a practical architecture for observability sampling in distributed systems: where it belongs, how it fits with microservices and Kafka-based flows, how to migrate toward it without destabilizing production, and the tradeoffs you inherit the moment you stop pretending everything should be kept forever. microservices architecture diagrams
Context
Distributed systems changed the shape of failure.
In a monolith, debugging was often ugly but local. You could tail one log, inspect one process, and reason about one runtime. In a service-based estate, even a simple business capability like “place order” becomes a conversation among many bounded contexts: cart, pricing, inventory, payment, shipping, notification, fraud, customer profile. Some are synchronous HTTP calls. Others are asynchronous events on Kafka. A few are old systems disguised as APIs. Every hop produces telemetry. Every hop also multiplies ambiguity.
That ambiguity has created a generation of observability platforms built on three pillars: logs, metrics, and traces. Add profiles, events, and exemplars, and the stack gets richer still. But richness is not clarity.
Enterprises now face a recurring pattern:
- telemetry volumes grow faster than application traffic
- tracing adoption expands unevenly across teams
- log cardinality becomes a hidden tax
- Kafka consumers generate observability bursts during replay or lag recovery
- incident investigations still devolve into ad hoc grep sessions and tribal knowledge
The obvious answer — keep everything — is rarely viable. Storage is expensive. Indexing is more expensive. Search latency matters. Cross-region replication matters. Compliance retention matters. And in heavily regulated sectors, the cost is not merely financial. Over-collecting can become a governance problem. EA governance checklist
So sampling emerges as a necessary architectural capability.
But let’s be precise. Sampling is not one thing. It exists at multiple layers:
- head-based sampling: decide at the start of a request whether to keep telemetry
- tail-based sampling: decide after seeing the outcome whether a trace is valuable
- log sampling: keep representative or policy-selected logs
- metric downsampling: reduce resolution or cardinality while preserving signal
- event sampling: selectively emit business or platform events
- adaptive sampling: dynamically change rates based on load, errors, tenants, or endpoints
A mature enterprise architecture rarely uses only one of these. It composes them.
Problem
The core problem is simple to state and hard to solve:
How do we preserve diagnostic and business-relevant observability in a distributed system while controlling telemetry cost, latency, and operational complexity?
The problem gets harder because “valuable telemetry” is contextual.
A trace from a healthy /health endpoint is almost worthless. A trace from a failed funds-transfer spanning mobile API, identity service, anti-fraud decisioning, core banking, and notification is gold. A million normal Kafka consumer spans may tell you little. One poisoned message causing retries, backpressure, dead-letter growth, and downstream idempotency issues is the story you needed.
Yet many teams sample blindly:
- fixed percentage across all traffic
- one-size-fits-all collector configuration
- no link between business criticality and retention
- no distinction between synchronous request flows and asynchronous event workflows
- no mechanism to reconcile sampled traces with unsampled aggregate metrics
That is not architecture. That is cost control masquerading as strategy.
The deeper problem is semantic mismatch. Platforms often sample by technical attributes — route, status code, latency — while the business operates by domain outcomes — payment declined, order orphaned, inventory oversold, claim adjudication delayed, shipment exception unresolved. If the sampling policy does not understand domain significance, the system will faithfully preserve trivia and discard incidents.
Forces
Several forces push against each other here.
1. Cost versus fidelity
Full-fidelity observability is seductive. It promises certainty. In reality it produces invoices and index contention. Sampling reduces cost, but every reduction risks losing rare signals.
2. Early decision versus informed decision
Head sampling is cheap and scalable because you decide once, at ingress. But it cannot know whether the request will fail three services later. Tail sampling is smarter because it sees the outcome, but it requires buffering, coordination, and more collector state.
3. Platform standardization versus domain-specific policy
Central platform teams want one policy framework. Domain teams need different rules. A checkout service, a recommendation engine, and an HR directory should not be sampled the same way.
This is where domain-driven design matters. Observability belongs to the domain model more than many teams admit. “Critical transaction” should be a domain concept, not a vague operator sentiment.
4. Synchronous versus asynchronous flow visibility
HTTP traces have clear beginnings and endings. Kafka-based workflows do not. A business process might span multiple topics, retries, compensations, and delayed consumers. Sampling must account for event lineage, causation, and replay behavior.
5. Incident response versus long-term analytics
Incident debugging wants rich, correlated, often temporary detail. Capacity planning and SLO governance want statistically stable aggregates. One architecture must often serve both. ArchiMate for governance
6. Regulatory and privacy pressure
Sampling can help reduce sensitive data exposure by collecting less. It can also make audits harder if evidence disappears. Financial services, healthcare, and public sector systems often need explicit policies around what cannot be sampled away.
Solution
The right answer in most enterprises is a multi-layered, policy-driven sampling architecture with domain-aware classification, tail sampling for high-value traces, and reconciliation through unsampled aggregate telemetry.
That sentence carries a lot of weight, so let’s unpack it.
First principle: classify traffic by business meaning
Before talking rates, define categories. This is pure domain-driven design.
Examples:
- Critical business transactions: payment authorization, funds transfer, claims submission, medication order
- Operationally sensitive flows: provisioning, identity changes, access control decisions
- Commodity requests: search suggestions, static content, cache refreshes
- Background technical traffic: health checks, heartbeats, internal synchronization
- Asynchronous domain workflows: order lifecycle events, fraud review, shipping updates
These should map to bounded contexts and ubiquitous language. If the commerce domain speaks of “checkout,” “refund,” and “chargeback,” then the observability policy should too. Don’t force everyone into a generic “tier-1 endpoint” taxonomy that nobody believes.
Second principle: decide at the edge, refine in the observability pipeline
Use a two-stage approach:
- Head sampling at ingress to prevent runaway telemetry volume.
- Tail sampling in collectors to preserve complete traces for errors, high latency, rare domain events, and policy-marked critical transactions.
This gives cost control without total blindness.
Third principle: preserve unsampled truth elsewhere
Sampling only works if you maintain reliable aggregate views outside the sampled stream:
- metrics should remain broadly unsampled or statistically safe
- domain event counts should be reconciled against business systems
- logs for audit-critical actions may require separate retention
- exemplars can link aggregate anomalies to representative traces
Sampling should not become your only source of truth. It is a lens, not the ledger.
Fourth principle: treat Kafka workflows as first-class observability citizens
For event-driven systems, propagate trace context in message headers, capture producer and consumer spans, and create policies for:
- retries
- dead-letter queue events
- replay traffic
- batch consumers
- compensating transactions
Without this, your architecture will sample HTTP nicely and remain blind to where the actual business complexity lives.
Architecture
A practical architecture usually has five layers:
- Instrumentation layer
- Policy classification layer
- Collector and sampling layer
- Storage tiering layer
- Reconciliation and governance layer
Logical view
1. Instrumentation layer
Use consistent instrumentation standards, ideally OpenTelemetry or an equivalent portable model. The point is not fashion. The point is decoupling your applications from your vendors.
Emit:
- spans for request and message handling
- metrics for latency, throughput, queue lag, retries, saturation
- structured logs with correlation identifiers
- domain events where business state changes matter
The key architectural move is to enrich telemetry with domain attributes:
business.transaction_type=checkoutbusiness.criticality=highcustomer.tier=enterpriseworkflow.name=claims_adjudicationevent.type=order_failedreplay=true
These aren’t decorative tags. They are the raw material for meaningful sampling decisions.
2. Policy classification layer
This can sit at the API gateway, service mesh, collector, or a shared library, but the responsibilities are the same:
- identify endpoint or message type
- derive business criticality
- attach tenant or jurisdiction metadata where allowed
- identify whether the flow is user-facing, batch, replay, or background
- mark “must-keep” candidates such as security events or regulated actions
Think of this as turning technical traffic into domain-observable traffic.
A useful pattern is a sampling policy registry maintained jointly by platform engineering and domain teams. Platform owns the mechanism. Domain teams own the semantics.
3. Collector and sampling layer
This is where most architecture discussions become too simplistic. “We’ll use tail sampling” is easy to say. It is also easy to break.
A robust collector topology often includes:
- node-local or sidecar collectors for buffering and initial filtering
- regional collectors for aggregation and policy execution
- tail-sampling processors capable of seeing enough of a trace before deciding
- backpressure controls to avoid observability pipeline collapse during incidents
Typical policy examples:
- keep 100% of traces with errors
- keep 100% of critical transaction types
- keep 100% of traces over latency threshold
- keep 100% of security-sensitive actions
- keep 20% of standard checkout traces
- keep 1% of low-value commodity requests
- keep 0% of health checks except during incident mode
- reduce replay traffic to near-zero except DLQ and failure paths
4. Storage tiering layer
Not all telemetry deserves premium storage.
Use tiering:
- hot store for recent searchable traces and logs
- warm store for lower-cost retention
- cold archive for compliance or forensic recovery
- metrics store for long-lived aggregate observability
This is where architecture meets economics. Index only what people search. Archive what they may need. Delete what has no business or operational value.
5. Reconciliation and governance layer
Here is the part many observability designs omit: sampled telemetry must be reconciled with unsampled business facts.
If your traces show 10,000 successful checkouts but the order system records 10,412, that discrepancy should be explainable. Maybe 412 were outside the trace sampling policy. Fine. Then the aggregate metrics and business event counts should reflect that. If they don’t, your observability architecture is lying by omission.
This matters most in event-driven flows, where retries, duplicates, and compensations muddy the waters.
Sampling policy flow
Migration Strategy
Nobody should roll out observability sampling by flipping a global switch on Friday afternoon. Sampling changes what you can see. That means it changes operational risk. Migrate the same way you would migrate a core business capability: incrementally, with reconciliation.
A progressive strangler migration works well here.
Phase 1: establish baseline observability truth
Before introducing aggressive sampling, capture a short period of higher-fidelity telemetry on selected systems. This gives you a baseline for:
- traffic shape
- trace cardinality
- error distribution
- endpoint and topic value
- cost profile
- current blind spots
Don’t skip this. You need before-and-after evidence.
Phase 2: classify domains and flows
Work with domain owners to define categories. This is a workshop exercise as much as an engineering one.
Questions to ask:
- which business workflows are revenue-critical?
- which failures are legally or operationally sensitive?
- which Kafka topics represent core business events versus internal plumbing?
- which tenants, regions, or products justify different retention?
- which flows need auditability beyond normal debugging?
This is where bounded contexts matter. A payment bounded context likely gets different policy than catalog search.
Phase 3: introduce head sampling for low-value traffic
Start with the obvious noise:
- health checks
- static asset requests
- verbose framework logs
- low-value internal endpoints
- replay traffic in non-production-like operational tasks
Keep metrics broad so you can verify no major blind spots emerge.
Phase 4: add tail sampling for critical traces
Now layer in tail sampling where it earns its keep:
- errors
- high latency
- high-value business transactions
- multi-hop traces across important contexts
- unusual Kafka consumer failures
- DLQ production events
This usually requires collector topology changes and more memory. Test under load. Tail sampling systems fail in very practical ways: queues fill, partial traces arrive, decisions time out.
Phase 5: reconcile and compare
For a migration window, run comparative analysis:
- sampled traces versus full metrics
- sampled logs versus incident tickets
- sampled business workflow counts versus system of record
- Kafka producer/consumer counts versus traced event counts
If operators are missing incidents or SREs need manual overrides every week, your policy is wrong.
Phase 6: retire old “collect everything” paths
Only after confidence is established should you remove legacy pipelines, duplicate exporters, or expensive full-fidelity stores.
This is classic strangler thinking: wrap the old world, redirect traffic gradually, compare outputs, then cut over with evidence.
Enterprise Example
Consider a global retail bank modernizing its digital payments platform.
The bank has:
- mobile and web channels
- API gateway
- microservices for identity, account lookup, payment initiation, fraud scoring, limits, notification
- Kafka for event propagation and downstream settlement workflows
- an old core banking platform exposed through a service facade
- strict audit obligations for payment and access-related actions
At first, the bank attempted near-complete trace retention for all payment-related services. It worked for three months. Then traffic increased, card campaigns launched, Kafka replay jobs ran after a consumer bug, and observability costs exploded. Search performance degraded exactly when incident response needed it most.
The platform team’s first instinct was crude percentage sampling. It reduced cost and wrecked investigations. Failed payments were sometimes missing. Fraud decision traces were inconsistent. Replay traffic polluted dashboards. Everyone blamed the tooling. The tooling was not the issue. The policy was.
The redesign used domain-aware sampling.
Domain categories
- must-retain: payment submission, beneficiary changes, authentication events, fraud escalations
- high value: payment authorization, AML decisioning, settlement exceptions
- standard: successful balance lookup, transaction history retrieval
- low value: health checks, feature flag polling, static config refresh
Kafka policies
- producer spans retained for all payment domain events
- consumer traces retained fully for failed settlement and exception workflows
- replay jobs tagged with
replay=trueand sampled at minimal rates unless failures occurred - DLQ publication always retained
Storage policies
- 7 days hot searchable traces for critical flows
- 30 days warm retention for standard traces
- 1 year archived audit logs for security and payment actions
- metrics retained long-term for SLO and risk reporting
Reconciliation
The bank reconciled sampled traces with:
- payment ledger counts
- fraud decision records
- settlement exception case volumes
- Kafka topic throughput and consumer lag metrics
This mattered. During one incident, a latency issue in the limits service caused cascading payment delays. Tail sampling retained the slow and failed traces. Aggregate metrics showed the full scope. Kafka lag metrics exposed downstream backlog. Because replay traffic was tagged and deprioritized, operators did not mistake a backfill for live customer impact.
That is what good architecture looks like: not maximal data, but meaningful signal.
Operational Considerations
Sampling architecture lives or dies in operations, not in slide decks.
Collector sizing and resilience
Tail sampling needs memory and time. If collectors cannot buffer enough spans to decide on a trace, you get partial visibility and false confidence. Plan for burst traffic, retries, and regional failover. If the observability pipeline collapses during an incident, sampling has failed its one important test.
Dynamic policy control
You need the ability to raise fidelity temporarily:
- during incidents
- for a specific tenant
- for a suspicious endpoint
- for a Kafka topic showing retries or lag
- during migrations or releases
But be careful. “Just turn sampling off” is often impossible at scale. Build bounded emergency modes with expiration.
Correlation IDs and trace propagation
In microservices this is standard. In Kafka estates it is often inconsistently implemented. Without propagation in headers, asynchronous observability disintegrates into local anecdotes. Standardize this early.
Redaction and privacy
Sampling less does not absolve you from data handling duties. Sensitive fields must still be redacted before export. The best policy is to avoid emitting regulated payload content at all.
Multi-region and tenancy concerns
Large enterprises often need per-region policies due to data residency and per-tenant controls due to contractual obligations. A premium enterprise tenant may justify richer tracing than a free-tier customer. Be explicit about that economics.
SLO alignment
Sampling policies should support service-level objectives, not undermine them. If latency SLOs depend on histogram metrics and exemplars, ensure the exemplar linkage remains useful under your sampling rates.
Tradeoffs
There is no sampling architecture without tradeoffs. The only honest question is which ones you are choosing.
Head sampling
Pros
- cheap
- simple
- reduces volume early
- good for controlling runaway costs
Cons
- misses rare bad outcomes
- breaks complete trace capture for low-probability failures
- often blind to downstream effects
Tail sampling
Pros
- preserves valuable traces based on outcome
- strong for errors and latency anomalies
- better for critical business journeys
Cons
- operationally heavier
- requires buffering and coordination
- can produce partial traces if spans arrive late
- harder in very high-throughput systems
Domain-aware policy
Pros
- aligns observability with business value
- supports governance and audit thinking
- makes cost decisions explainable
Cons
- requires domain engagement
- policies drift as products evolve
- platform teams cannot own semantics alone
Event-driven observability coverage
Pros
- improves visibility into where real enterprise complexity lives
- captures retries, DLQs, and eventual consistency behavior
Cons
- trace graphs become messy
- replay traffic can distort reality
- propagation discipline is often weak
The lesson is plain: sampling is not free savings. It is selective perception. Build it accordingly.
Failure Modes
The most dangerous failure mode is not losing data. It is believing the sampled view is complete.
Common failure modes include:
1. Biased sampling
If successful fast paths are oversampled and weird edge cases are undersampled, your dashboards tell a comforting lie. Adaptive policies can make this worse if they optimize for throughput rather than significance.
2. Missing asynchronous causality
A request span is retained, but downstream Kafka consumer spans are dropped or never correlated. Engineers see the front half of a story and miss the event-driven tail where the damage happened.
3. Replay pollution
Kafka replays generate massive telemetry that looks like live processing unless tagged properly. This can trigger false alarms or hide real customer-facing degradation.
4. Partial trace retention
Collectors receive spans late or out of order and make incorrect keep/drop decisions. The resulting trace is fragmented, which is often worse than absent because it invites false conclusions.
5. Audit gaps
Teams assume observability storage can satisfy compliance review. Then a sampled-away event becomes the exact thing legal asks for. Audit trails should have separate policy and ownership.
6. Policy drift
New services, topics, and endpoints appear, but sampling rules are never updated. Architecture decays quietly. The high-value paths of today become tomorrow’s blind spots.
7. Cost inversion
Aggressive tail sampling plus verbose logging plus high-cardinality metrics can still produce runaway spend. Sampling traces alone does not fix a sloppy observability model.
When Not To Use
Sampling is not always the right answer, or not the right answer everywhere.
Do not rely on aggressive sampling when:
- transaction volumes are low and full fidelity is affordable
- systems are safety-critical and every event requires retention
- legal or audit obligations require complete records
- you are in early-stage incident discovery and still learning where failure lives
- telemetry maturity is too low to classify domain-critical flows correctly
- the estate lacks reliable trace propagation, especially across Kafka
In these situations, start with disciplined full fidelity on scoped systems, then sample selectively later. Premature sampling is like putting tinted windows on a car before you’ve learned to drive it.
Also, don’t use sampling as a substitute for observability design. If logs are unstructured, metrics are meaningless, and traces lack domain context, sampling will simply preserve a smaller volume of bad telemetry.
Related Patterns
Several adjacent patterns pair naturally with observability sampling architecture.
Event sourcing and audit trails
If business truth lives in an event log, observability can sample heavily while the event store remains authoritative. The two should be reconciled, not confused.
Outbox pattern
For Kafka-based systems, the outbox pattern improves reliability of event publication and creates a cleaner seam for tracing producer behavior and reconciling emitted business events.
Saga and compensation
Long-running distributed workflows often require compensation logic. Sampling should explicitly retain failed and compensated saga paths because those are exactly where distributed complexity reveals itself.
Bulkheads and circuit breakers
When resilience mechanisms trigger, sampling rates should often increase for affected flows. Otherwise, the system sheds load and sheds observability at the same moment.
Strangler fig migration
As discussed earlier, introducing new observability pipelines alongside old ones and comparing outputs is the safest way to migrate at enterprise scale.
Exemplars and RED/USE metrics
Metrics remain your aggregate truth. Exemplars link spikes and latency anomalies to representative traces, making sampled tracing far more useful.
Summary
Observability sampling is not a storage trick. It is an architectural commitment about what your organization chooses to notice.
The mature design is neither “trace everything” nor “sample 1% and hope.” It is a layered model: domain-aware classification, head sampling for basic cost control, tail sampling for high-value outcomes, strong Kafka and microservice trace propagation, and reconciliation against unsampled metrics and business records.
That last part matters most. In enterprise systems, sampled telemetry cannot be the only witness. It must be checked against ledgers, event stores, operational metrics, and audit logs. Otherwise you are not observing the system. You are observing your own filters.
If I had to reduce the whole architecture to one opinionated line, it would be this:
Sample by meaning, not by volume.
Do that, and your observability platform becomes a decision system. Ignore it, and it becomes an expensive attic full of boxes nobody opens until the house is already on fire.
The key is not replacing everything at once, but progressively earning trust while moving meaning, ownership, and behavior into the new platform.
Frequently Asked Questions
What is a service mesh?
A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.
How do you document microservices architecture for governance?
Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.
What is the difference between choreography and orchestration in microservices?
Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.