⏱ 20 min read
Microservices were supposed to free us from the giant ball of mud. Then we smuggled the mud back in through YAML.
That is the uncomfortable truth in many cloud estates. Teams split systems into smaller deployable units, put APIs around bounded contexts, add Kafka for asynchronous coordination, move infrastructure to Kubernetes, and feel briefly modern. Then configuration begins to grow. Feature flags. Routing rules. Tenant mappings. Retry policies. Topic names. Schema versions. Region affinity. Data retention settings. Access scopes. Fallback endpoints. Reconciliation windows. What looked like harmless externalization slowly becomes a hidden web of dependencies. The code is decoupled; the runtime is not.
And that web is often more dangerous than direct code coupling, because it is harder to see, easier to change, and usually governed by nobody.
This is where the notion of a config dependency graph matters. Not as a fancy visualization for architecture slides, but as a practical way to understand how cloud microservices actually depend on each other through runtime settings, platform conventions, and operational policies. If service behavior changes because a topic name changes, a schema registry compatibility mode flips, a consumer group ID is shared, or a feature toggle is enabled in the wrong order, then configuration is part of your architecture. Pretending otherwise is how you get outages that no one can reproduce locally.
The core argument of this article is simple: configuration is part of the domain and part of the system design. Treat it as first-class architecture. Model it. Version it. constrain it. Migrate it with the same care you would apply to data or APIs. Otherwise your microservices will be loosely coupled in source code and tightly coupled everywhere that hurts.
Context
Cloud-native systems made configuration both easier and more dangerous.
In the monolith era, much configuration lived in properties files and deployment scripts. It existed, but its blast radius was often limited by the fact that one deployment changed one thing. In modern distributed systems, configuration fans out across services, environments, clusters, regions, brokers, API gateways, service meshes, secrets stores, CI/CD pipelines, and observability tooling. The system’s real behavior emerges from the interaction of all these settings.
This is particularly visible in event-driven microservices. A payment service publishes a PaymentAuthorized event to Kafka. An order service consumes it. A fulfillment service waits for a transformed event from another topic. A reporting service reads a compacted topic. A reconciliation process later compares transactional state with event history and external ledgers. Each of these behaviors depends not only on code but on configuration: topic bindings, partition counts, consumer groups, dead-letter policies, backoff windows, idempotency keys, schema compatibility modes, and retention settings.
Now add cloud platforms. Kubernetes ConfigMaps and Secrets. Helm values. Environment overlays. Service mesh routing. Cloud-managed Kafka ACLs. Feature flags. Central config servers. Policy-as-code. It becomes very easy to create dependency chains no single team can fully explain. event-driven architecture patterns
This is not a tooling complaint. It is a design complaint.
A service should own behavior inside its bounded context. But if that behavior is materially controlled by shared configuration owned elsewhere, the real boundary is weaker than the code suggests. This is classic domain-driven design thinking: boundaries are not what the repository says; they are what the business and runtime demand. If the pricing service cannot safely change discount rules without coordinated changes in catalog config, campaign toggles, and downstream cache invalidation windows, then those contexts are still more entangled than the org chart admits.
Configuration is often where coupling survives architecture modernization.
Problem
Most organizations externalize configuration for good reasons: environment-specific values, secrets management, operational agility, and safer deployments. The trouble starts when they externalize decisions, not just settings.
A database connection string is a setting. A cross-service mapping table that determines whether an order becomes eligible for fulfillment is a decision. A timeout is a setting. A centrally managed retry policy that changes the semantics of order submission is a decision. A topic name is a setting. A shared topic carrying multiple business meanings because “it was easier” is a decision with architectural consequences.
Once business or integration semantics escape into config, teams create dependencies that are not obvious in code review. These dependencies become especially problematic when:
- one config change requires multiple services to change in lockstep
- no service owns the end-to-end meaning of a config value
- operational teams can alter domain behavior without domain review
- config is versioned independently of code and data migrations
- rollback of config is not equivalent to rollback of system behavior
- event-driven contracts depend on platform settings not visible in service repos
This produces what I would call runtime coupling through configuration. The services are independently deployable in theory. In practice, they are coordinated by a fragile graph of externalized values.
That graph has several common shapes:
- Shared constants masquerading as standards
One team defines topic names, enum mappings, or feature IDs that many services consume. The standard gives consistency, but also centralizes fragility.
- Control-plane coupling
API gateway routes, service mesh policies, broker ACLs, and central feature management become hidden upstream dependencies for service behavior.
- Semantic leakage
Domain concepts get encoded as generic config structures. Think “product-type-to-workflow-map” in a central file used by five services. Nobody owns the model, everyone depends on it.
- Temporal coupling
Changes must happen in sequence across services and environments. If not, messages are dropped, consumers reject events, or duplicate actions occur.
- Compensating configuration
Teams add config to paper over weak domain boundaries: retries for non-idempotent endpoints, translation tables for inconsistent concepts, polling intervals to reconcile asynchronous drift.
The result is a system that is operationally clever and architecturally brittle.
Forces
This problem persists because the forces are real, and many of them are reasonable.
Teams want autonomy
Externalized configuration allows quick changes without code redeployment. That is a powerful operating model. Product teams can launch features, SRE teams can tune behavior, and platform teams can standardize environments. Nobody wants to go back to rebuilding jars just to change a timeout.
Enterprises need consistency
Large organizations value standardization: central topic naming conventions, common retry policies, organization-wide security settings, regional deployment templates. This reduces chaos. But every standard creates shared dependencies. Standards are useful; they are not free.
Domains evolve unevenly
Bounded contexts do not mature at the same pace. One area still has fuzzy language, another has stable events and clear ownership. During this uneven evolution, configuration often becomes a temporary bridge. Temporary bridges in enterprises have the lifespan of Roman roads.
Event-driven systems shift coupling, not eliminate it
Kafka decouples producers and consumers in time and deployment, but not in semantics. Topic contracts, retention windows, ordering guarantees, keying strategies, and replay behavior all matter. Much of that is configured outside the service code. Messaging infrastructure gives a new place for coupling to hide.
Operations need escape hatches
Feature flags, kill switches, traffic shaping, and dynamic policies are necessary for resilience. In production, the ability to turn things off is often the difference between an incident and a headline. But every escape hatch is also another pathway for ungoverned behavior change.
Auditors love central controls
Regulated environments often centralize access control, retention, encryption, and regional restrictions. Quite rightly. Yet central controls can quietly shape domain outcomes. A data retention rule may affect reconciliation ability. An ACL change may delay compensating actions. Governance and domain design collide more often than they should. EA governance checklist
These forces cannot be wished away. Good architecture does not eliminate them. It puts them in their place.
Solution
The solution is not “avoid configuration.” That would be adolescent architecture. The solution is to treat configuration as a modeled dependency surface, separate operational tuning from domain semantics, and make the config dependency graph explicit.
Here is the opinionated version:
If a configuration value changes business meaning, it belongs in a bounded context, with an owner, lifecycle, validation rules, and migration strategy.
That means several concrete practices.
1. Classify configuration by semantic weight
Not all configuration is equal. Start by separating it into categories:
- Infrastructure config: connection endpoints, resource sizes, TLS settings
- Operational config: timeouts, retry counts, circuit breaker thresholds, feature kill switches
- Integration config: topic names, endpoint routes, schema references, consumer groups
- Domain config: eligibility rules, mapping tables, workflow selection, pricing parameters, reconciliation windows tied to business policy
The dangerous category is domain config disguised as integration or operational config. That is where coupling hides.
2. Model domain-bearing config inside the domain
If a value encodes business meaning, it should not live as an anonymous key in a platform repository. It should have a model, invariants, versioning, and ownership in the bounded context that understands it.
For example, “settlement grace period by payment method” is not just config. It is domain policy. It belongs to the payments context, ideally exposed through a policy model or managed aggregate, not hidden in a global YAML file.
3. Build and maintain a config dependency graph
A config dependency graph maps:
- services
- config sources
- topics and schemas
- operational controls
- ownership boundaries
- change propagation paths
This graph reveals hidden runtime coupling. It helps answer practical questions:
- What breaks if this topic retention changes?
- Which services depend on this feature flag?
- Can we migrate this schema compatibility setting incrementally?
- Which bounded context owns this mapping?
You do not need a perfect graph to start. A useful imperfect graph beats tribal memory.
4. Design for compatibility, not synchronized change
The surest sign of bad config coupling is the need for lockstep changes. Avoid “everyone flips on Tuesday at 2 PM.” Instead use compatibility windows, dual reads, dual writes, schema evolution, and progressive rollout.
This is especially critical in Kafka-based microservices: microservices architecture diagrams
- support old and new topic names temporarily
- accept both schema versions where feasible
- version consumers independently
- separate event type evolution from broker-level config shifts
- use reconciliation to detect drift during migration
5. Put reconciliation in the architecture, not just operations
Distributed systems drift. Configuration changes amplify that drift. Reconciliation is how you recover truth when asynchronous paths, retries, and policy changes produce mismatch.
Many enterprises treat reconciliation as a back-office afterthought. That is a mistake. In systems where config shapes routing, retries, and event flow, reconciliation is a core architectural capability. It provides:
- detection of missed or duplicated processing
- validation across source-of-truth systems
- safe migration visibility
- compensating action triggers
If your architecture depends on dynamic config and asynchronous messaging, you need reconciliation the way a ship needs a compass.
Architecture
A robust architecture for coupling through configuration usually has four layers of concern.
Domain-owned policy layer
This is where business-meaningful settings live. They are modeled by the bounded context that owns them. Changes are validated against domain rules. Access is controlled by the owning team, not by generic platform admins alone.
Integration contract layer
This contains topic bindings, schemas, API contracts, routing identities, and compatibility rules. It is shared, but governed. Ownership is explicit. Versioning is deliberate.
Operational control layer
Feature flags, circuit breakers, traffic percentages, backoff tuning, kill switches. These should alter operational posture, not silently redefine domain semantics.
Platform configuration layer
Secrets, infrastructure parameters, resource policies, network controls. Essential, but not where domain logic should be smuggled.
Here is a reference view.
The key point is not the diagram. It is the separation of semantics. Once policy and domain meaning are distinct from raw deployment values, teams can reason about ownership and migration with much more clarity.
Migration Strategy
Most enterprises are already tangled. The right question is not “how should we have designed this?” but “how do we get from here to there without stopping the business?”
This is where a progressive strangler migration works well.
Do not rip out central config and replace it with purity. You will only create a new failure mode. Instead, peel meaning away from shared config in slices.
Step 1: Inventory and classify
Identify all externally managed values influencing service behavior. Classify them by semantic weight. You are looking for values that:
- control business flows
- are shared by multiple bounded contexts
- require coordinated rollout
- affect message contracts or consumer behavior
- are changed during incidents
Those are your high-risk nodes.
Step 2: Map the config dependency graph
Create a living graph that shows:
- producer/consumer relationships
- config sources and override precedence
- environment-specific differences
- hidden platform dependencies
- manual change points
At this stage, the graph is as much a social map as a technical one. Who can change what matters as much as what exists.
Step 3: Establish ownership
For each high-risk config item, define:
- domain owner
- operational owner
- approval path
- change window rules
- compatibility expectations
Many migration efforts stall here because teams discover shared config with no real owner. That discovery is not a delay. It is the work.
Step 4: Encapsulate domain config behind a service or domain module
If a central config file contains business mapping logic, move that logic into the owning bounded context. Expose it through:
- a policy API
- an internal domain service
- a versioned data product
- an event stream of policy changes
This does not mean every config item needs a microservice. Sometimes a strongly typed module and managed store inside the existing service is enough. The point is ownership and invariants.
Step 5: Introduce compatibility and dual-running
During migration, both old and new config paths may coexist. For Kafka-based systems, that often means:
- dual publishing to old and new topics
- consumers able to read both contracts
- schema translation adapters
- replay support
- reconciliation to verify equivalence
Step 6: Reconcile and cut over gradually
Use reconciliation to compare outcomes between old and new paths. Do not trust green deployments alone. In distributed systems, “deployed” is not “correct.”
Step 7: Remove central dependencies late
Only retire shared config once:
- consumers no longer depend on it
- observability confirms stable behavior
- rollback strategy is clear
- ownership is documented in the new model
Migration should feel boring. Boring is success.
Here is a typical strangler flow.
Enterprise Example
Consider a global retailer modernizing order fulfillment across e-commerce, stores, and third-party logistics providers.
They had moved from a large order management monolith to around forty microservices. Kafka handled order events, inventory updates, payment state, and shipment notifications. Kubernetes ran the services. A central configuration platform managed environment settings and feature flags across regions.
On paper, this looked healthy.
In reality, one of the most critical business flows — deciding how an order should be fulfilled — depended on a shared configuration bundle maintained by a platform-adjacent integration team. That bundle included:
- channel-to-fulfillment-mode mappings
- carrier routing preferences
- country restrictions
- fallback warehouse priority
- event topic names by region
- timeout and retry settings for provider calls
Different services consumed different pieces of it:
- Order orchestration used workflow mappings
- Inventory allocation used warehouse priority
- Shipping adapter services used carrier preferences
- Reconciliation jobs used timeout windows to decide when an order was “late”
- Analytics pipelines assumed regional topic naming conventions
Every time the business changed a fulfillment rule, three things happened:
- the config bundle changed
- several services behaved differently
- some downstream consumers broke in non-obvious ways
One memorable incident involved a regional rollout for same-day delivery. The workflow mapping was updated centrally. The order service began emitting a new route type. Inventory supported it. Shipping adapters in one region did not, because a feature flag lagged behind. Reconciliation then marked thousands of orders as exceptions because its lateness window had not been updated. Nothing was technically “down,” but the business process was wounded everywhere.
The fix was not to centralize more tightly. It was to redraw ownership.
The retailer identified fulfillment policy as a domain concern belonging to the fulfillment bounded context. They created a domain-owned policy service with a versioned model for routing rules and constraints. Kafka event contracts were separately versioned. Topic naming conventions stayed centralized, but business mappings moved out. Reconciliation rules were explicitly tied to policy version, not inferred from generic operational config.
Migration took six months. They ran old and new policy paths in parallel for a subset of countries, dual-tagged events with policy versions, and used reconciliation to compare allocation and shipment outcomes. Several hidden assumptions surfaced:
- analytics jobs hardcoded old route categories
- one logistics provider adapter relied on default retries to trigger a manual escalation
- retention settings on an intermediate topic were too short for effective replay during cutover
These are exactly the sort of truths that only emerge when you map the config dependency graph and force ownership conversations.
The result was not perfect decoupling. It was better: fewer shared semantic dependencies, clearer governance, and much safer regional rollout. ArchiMate for governance
Operational Considerations
Once you admit that configuration is architecture, operational practices have to grow up too.
Version configuration explicitly
Not just git commit hashes. Use meaningful semantic versions for domain-bearing config and contract-affecting settings. Services should log and emit the config version influencing their decisions.
Observe config as part of runtime state
Dashboards should answer:
- which policy version is active where
- which consumers still use old topic bindings
- which feature flags are changing request behavior
- where config drift exists across regions or clusters
Validate before rollout
Use schema validation, policy validation, and dependency impact analysis. A config linter that only checks syntax is not enough. You want semantic validation:
- is every workflow route supported by downstream consumers?
- does retention support replay needs?
- will changing partition count affect ordering assumptions?
- does a reconciliation rule still align with settlement policy?
Audit change paths
In enterprise systems, outages often start with “just a config change.” Fine. Then make those changes observable and attributable. Record who changed what, why, approved by whom, and what graph nodes were affected.
Support safe rollback, but distrust it
Rollback sounds comforting. In distributed systems it lies. A rolled-back config may not reverse already emitted events, delayed retries, or customer-visible side effects. This is why reconciliation and compensating actions matter more than rollback theater.
Tradeoffs
There is no free lunch here.
More modeling means more work
Treating domain-bearing config as first-class design introduces more explicit ownership, versioning, and validation. Some teams will complain that this slows them down. They are partly right. It slows down careless change. That is a feature.
Central platforms lose some convenience
Platform teams often prefer one place to manage everything. Enterprises love a single pane of glass. But a single pane of glass can become a single pane of accidental architecture. Decentralizing semantic ownership creates healthier boundaries, at the cost of some uniformity.
More compatibility windows mean more temporary complexity
Dual reads, dual writes, schema adapters, and reconciliation add moving parts. Migration architectures are untidy. Still better than synchronized outages.
Not all config deserves ceremony
You can overcorrect. If every timeout requires a domain review board, you have built bureaucracy, not architecture. The discipline is in distinguishing business meaning from operational tuning.
Failure Modes
These systems fail in patterns. Learn the patterns.
Shared config becomes a hidden monolith
A central repository starts as convenience and ends as the real orchestrator of the system. Services become plugins for a giant YAML brain.
Feature flags become business logic
Flags intended for release control begin determining permanent workflow semantics. Nobody knows the true state machine anymore.
Kafka indirection hides semantic mismatch
Teams assume topics decouple them, but topic config, schema compatibility, keying, and retention bind them together in ways only production reveals.
Reconciliation is added too late
Without reconciliation, drift accumulates silently. The first sign is often a finance or customer service escalation, not a technical alarm.
Ownership is split by tool, not domain
Platform owns config store, integration owns topics, app teams own services, operations own flags, and nobody owns meaning. This is the most common enterprise failure mode of all.
When Not To Use
Do not over-engineer this approach everywhere.
You probably do not need a formal config dependency graph program if:
- your system is small, with a handful of services and one team
- config changes are infrequent and low semantic impact
- services do not share domain-bearing configuration
- eventing is limited and replay/reconciliation are not core concerns
- a modular monolith would serve the domain better
In fact, this is worth saying plainly: if your “microservices” mostly coordinate through shared configuration and synchronized change, you may not have earned microservices yet. A modular monolith with explicit domain modules might be cheaper, safer, and easier to evolve.
Also avoid pushing all domain policy into a remote policy service simply because central config is bad. Remote indirection is not architecture maturity by itself. If every request now depends on a policy lookup, you may trade one coupling problem for latency and resilience issues. Sometimes a replicated, versioned policy dataset inside a bounded context is the better choice.
Related Patterns
Several architectural patterns intersect here.
- Bounded Contexts: the core DDD tool for assigning ownership of meaning.
- Strangler Fig Pattern: ideal for progressively extracting semantic config from shared stores.
- Anti-Corruption Layer: useful when translating legacy config semantics into cleaner domain models.
- Event Versioning: essential when config and contracts evolve independently.
- Outbox Pattern: helps ensure event publication remains consistent during migration.
- Reconciliation / Compare-and-Repair: critical in asynchronous systems with eventual consistency.
- Control Plane vs Data Plane separation: useful language for distinguishing config governance from runtime business flow.
- Platform as Product: central config services should be products with explicit contracts, not dumping grounds.
Summary
Configuration is not background noise. In cloud microservices, it is often where the real coupling lives.
The config dependency graph gives architects a way to see that hidden structure: how services, topics, policies, feature flags, schemas, and platform settings shape one another at runtime. Once visible, you can make better decisions about ownership, migration, compatibility, and reconciliation.
The most important move is conceptual. Stop treating all configuration as inert deployment data. Some of it carries domain semantics. When it does, it must be modeled, owned, versioned, and migrated like any other important part of the system.
That means:
- separating operational config from business policy
- assigning semantic ownership to bounded contexts
- avoiding lockstep changes through compatibility design
- using progressive strangler migration to unwind shared config
- building reconciliation into event-driven architectures
- accepting the tradeoff that safer systems require more deliberate modeling
The payoff is not abstract elegance. It is fewer invisible dependencies, fewer coordinated release rituals, and fewer incidents caused by changes that “weren’t code.”
Microservices do not become decoupled because we split the repository. They become decoupled when meaning has a home, dependencies are explicit, and runtime behavior is governed by design rather than by accident.
And if you remember only one line, make it this one:
Every shared configuration key is a tiny architecture decision waiting to become an outage.
The key is not replacing everything at once, but progressively earning trust while moving meaning, ownership, and behavior into the new platform.
Frequently Asked Questions
What is a service mesh?
A service mesh is an infrastructure layer managing service-to-service communication. It provides mutual TLS, load balancing, circuit breaking, retries, and observability without each service implementing these capabilities. Istio and Linkerd are common implementations.
How do you document microservices architecture for governance?
Use ArchiMate Application Cooperation diagrams for the service landscape, UML Component diagrams for internal structure, UML Sequence diagrams for key flows, and UML Deployment diagrams for Kubernetes topology. All views can coexist in Sparx EA with full traceability.
What is the difference between choreography and orchestration in microservices?
Choreography has services react to events independently — no central coordinator. Orchestration uses a central workflow engine that calls services in sequence. Choreography scales better but is harder to debug; orchestration is easier to reason about but creates a central coupling point.