Real-Time Analytics Is Often Micro-Batch

⏱ 18 min read

Everyone says they want real-time analytics.

What they usually mean is something else.

They want the dashboard to move quickly enough that nobody notices the machinery behind it. They want fraud alerts before the money leaves the account, operations metrics before the customer phones support, inventory visibility before a buyer oversells stock that does not exist. They are not buying “real time” as a philosophical category. They are buying confidence within a business tolerance window.

And that is where architecture gets interesting.

Because much of what enterprises proudly call real-time analytics is, in practice, disciplined micro-batching wrapped in fast pipes, good product design, and selective honesty. The numbers arrive every few seconds, sometimes every few hundred milliseconds, and that is enough. The illusion is useful. Better than useful, often optimal. It keeps systems cheaper, simpler, and more recoverable than the fantasy of processing every event individually the instant it appears.

This is not a criticism. It is a recognition of how serious systems survive.

The mistake is not using micro-batches. The mistake is pretending domain semantics do not matter and that “low latency” alone defines success. In enterprise architecture, words like real-time, near real-time, streaming, event-driven, and micro-batch get thrown around as if they were interchangeable. They are not. They imply different operational models, different failure modes, and different costs. More importantly, they imply different promises to the business.

A claims system that updates reserve exposure every fifteen seconds is not the same as a high-frequency trading engine. A supply chain control tower that reconciles inventory every minute is not the same as an industrial safety shutdown. Once you stop flattening those distinctions, architecture becomes less ideological and more useful.

This article makes a blunt case: most enterprise analytics that claim to be real-time are better understood as latency-managed micro-batch systems with selective event-driven behavior. We will look at why that happens, what forces push organizations there, how to design it with Kafka and microservices, how domain-driven design changes the shape of the solution, how to migrate from batch safely, and where this pattern breaks down. event-driven architecture patterns

Context

Enterprises usually begin with batch analytics because batch is easy to reason about.

Operational systems write transactions to relational databases. Overnight ETL jobs move that data into a warehouse. Reports are generated on a schedule. Finance likes the consistency. Audit likes the traceability. Operations tolerate the delay because there is no better option. That model works for a long time, right up until the business starts asking questions that expire before the nightly load completes.

That demand appears in very ordinary ways. Customer service wants order status that reflects today’s fulfillment exceptions, not last night’s warehouse snapshot. Risk teams want fraud signals before settlement. Merchandising wants promotion performance while the campaign is still running. Plant operations want anomaly detection before the line fails.

So the organization reaches for “streaming.”

This is where confusion starts. Some teams mean event-by-event processing with sub-second latency. Others mean data lands every minute. Some want dashboards to refresh continuously. Others want model scoring on recent windows. Vendors encourage the confusion because “real-time” sells better than “a bounded-latency pipeline using tumbling windows and periodic materialization.”

But business domains are not generic data flows. They have clocks, commitments, and tolerances.

In domain-driven design terms, latency is not a technical property alone. It is part of the domain contract. “Inventory available to promise within 30 seconds” is a business promise. “Card-not-present fraud decision in under 200 milliseconds” is a different promise. “Executive margin dashboard updated every five minutes” is yet another. If you collapse these into one architecture doctrine, you will either overspend dramatically or underdeliver where it matters.

That is why real architecture starts with semantics.

Problem

The problem is not that enterprises lack streaming tools. The problem is that they often adopt a streaming stack without clarifying what has to be true, how fresh it has to be, and what happens when the truth changes later.

Three mistakes show up repeatedly.

First, teams mistake ingestion speed for analytical truth. Kafka can move events quickly. That does not mean the metric should be updated immediately. A sales order created event is not the same as recognized revenue. A shipment scanned event is not inventory truth if returns, cancellations, and adjustments follow. Fast arrival is not stable meaning.

Second, teams ignore reconciliation. Operational systems are messy. Events arrive late, out of order, duplicated, or corrected. Source systems can fail to emit a change. Master data can be revised after the fact. If your architecture cannot reprocess history, compare against authoritative records, and issue corrections, your “real-time analytics” will drift into fiction.

Third, teams build generic pipelines disconnected from bounded contexts. They publish raw technical events and hope downstream analytics can infer business semantics later. This is architecture by sedimentary rock. Layer upon layer of accidental complexity. No one knows which event means “order committed,” which one means “payment accepted,” and which one is just a user interface save operation.

So the organization gets a dashboard that is fast, expensive, and strangely untrustworthy. The business learns a sad habit: use the streaming dashboard for trends, and wait for the overnight report for the real numbers.

That split-brain outcome is common. It is also avoidable.

Forces

Several forces pull enterprise analytics toward micro-batch designs even when the rhetoric says streaming.

1. Business freshness is usually windowed, not instantaneous

Most business decisions tolerate small delays. A logistics manager does not need every parcel scan propagated in 20 milliseconds. They need the route exception heatmap before the next operational intervention point. That often means seconds or minutes, not milliseconds.

2. Domain events are not uniformly meaningful

Not every state change deserves immediate analytics processing. In many domains, only certain milestones matter: order placed, payment captured, shipment dispatched, claim approved, reading validated. Everything else is noise or intermediate workflow. Domain-driven design helps identify those milestones. Once you do, micro-batching those milestones often delivers the business outcome with much lower cost.

3. Data quality degrades under speed pressure

The faster teams push data through, the more tempted they are to skip validation, enrichment, deduplication, and late-arrival handling. Yet these are exactly the controls that make analytics credible. Micro-batch windows create a practical place to apply these controls without pretending every event is final on first sight.

4. Storage and serving layers like materialization

Analytics consumers rarely query event streams directly. They query materialized views, OLAP stores, serving indexes, feature stores, or cache layers. These stores are often updated in small batches for compression, indexing efficiency, and predictable write patterns. Again: the user perceives real-time enough; the system runs on disciplined periodic materialization.

5. Recovery matters more than elegance

The most beautiful low-latency topology in the world is worthless if replay takes days, backpressure melts consumers, or a schema change silently corrupts metrics. Enterprises live in the long tail of operational pain. Micro-batch checkpoints, idempotent writes, replayable windows, and deterministic recomputation are not glamorous. They are what keep the lights on.

6. Cost curves are merciless

True event-by-event processing at scale can be expensive in compute, storage IOPS, operational staffing, and engineering complexity. Windowing events for 1-second, 5-second, or 30-second batches often cuts that cost dramatically while preserving business value.

There is a broader lesson here: architecture is the art of paying for what matters and refusing to pay for theatre.

Solution

The practical solution is a layered architecture that treats “real-time” as a set of latency classes, not a single promise.

At the center is a simple idea: use event streaming for movement and decoupling, use micro-batch windowing for most analytical transformations, and reserve true synchronous or event-by-event processing for narrow business moments that genuinely require it.

This usually leads to three categories of flow:

  1. Operational decisioning flow for actions that must happen immediately, such as fraud scoring or alert triggering.
  2. Analytical refresh flow for dashboards, aggregates, and trend computations updated every few seconds or minutes.
  3. Reconciliation flow for correction, replay, and alignment with authoritative sources.

That third flow is the one many teams forget. It is the reason mature systems remain trustworthy.

Domain-driven design is essential here. Events should emerge from bounded contexts with explicit business meaning. “OrderPlaced” from Order Management is useful. “RowUpdatedInOrdersTable” is not. “ClaimReserved” from Claims is meaningful. “StatusChangedTo3” is a cry for help.

Once events carry domain semantics, the analytics architecture can choose appropriate latency per event type and use case. Some events feed immediate detectors. Others land in short windows. Others are enriched and materialized every minute. This is not inconsistency. It is design.

Architecture

A typical enterprise design uses Kafka as the event backbone, microservices aligned to bounded contexts, and a stream processing or data processing layer that computes short-window aggregations and materialized views. microservices architecture diagrams

The flow looks roughly like this:

Architecture
Architecture

This architecture works because it separates concerns.

Kafka handles durable event transport, partitioned ordering within a key, decoupled consumption, and replay. Microservices publish domain events when business facts occur. A processing layer computes rolling counts, short tumbling windows, session aggregations, enrichment joins, and feature calculations. Hot analytical stores serve low-latency queries to dashboards and APIs. Meanwhile, the raw event lake or lakehouse provides durable history for reprocessing and reconciliation.

The key point is that the dashboard does not usually read “the stream.” It reads materialized state produced from the stream in short, controlled intervals.

This is the latency illusion in a good sense. The user sees fluid motion. Underneath, the system updates in windows and checkpoints. Like a film reel, continuity emerges from frames.

Domain semantics and bounded contexts

This architecture goes wrong if event topics are organized around technical tables instead of business language.

For example, in retail you might define bounded contexts such as Ordering, Fulfillment, Pricing, Inventory, and Customer Care. Each context emits a small vocabulary of business events. Analytics should consume those events as facts scoped to their originating context, not as globally meaningful truth without interpretation.

An “OrderPlaced” event is a commitment in the Ordering context. It is not yet shipped demand, not yet revenue, not yet fulfilled inventory reduction. If downstream analytics confuses those meanings, metrics diverge. This is why event contracts need ubiquitous language, schema governance, and clear definitions of effective time versus processing time. EA governance checklist

Event time, processing time, and business time

One of the oldest sources of pain in streaming analytics is pretending there is only one clock.

There are at least three:

  • Event time: when the business event occurred.
  • Processing time: when the system processed it.
  • Business effective time: when it should count for policy or reporting.

A claim adjustment entered today but effective from yesterday may need to amend yesterday’s reserve exposure. A late store transaction may belong to a previous trading interval. If your architecture cannot model these distinctions, “real-time” metrics become operationally attractive and financially dangerous.

Materialization pattern

Most consumers want query speed, not event purity. So the architecture typically materializes:

  • KPI summaries by short windows
  • dimensional rollups
  • anomaly scores
  • current-state snapshots
  • operational control tower views

This can be done continuously, but under the hood many engines group updates into mini-batches for throughput and consistency. That is not a compromise. It is how systems breathe.

Migration Strategy

No sensible enterprise replaces batch analytics in one dramatic cutover. That is how you end up with two failed systems instead of one successful old one.

The right approach is progressive strangler migration.

Start by identifying one business capability where latency genuinely matters and semantics are clear. Build an event stream around that bounded context. Compute a small set of analytics views in near-real time. Run them alongside the legacy batch reports. Compare, reconcile, and learn where the source data lies to you.

Then expand.

A typical migration path looks like this:

Diagram 2
Migration Strategy

The important word here is selected. Some batch pipelines should remain batch. Monthly financial close does not become better because someone inserted Kafka into the sentence.

Strangler principles applied to analytics

The strangler pattern is usually described for applications, but it applies beautifully to analytical architecture.

  • Wrap legacy data sources with CDC or domain event publication.
  • Introduce a new event-driven path for one bounded context.
  • Materialize a new view for one operational or analytical need.
  • Reconcile against the old warehouse outputs.
  • Shift consumers gradually.
  • Retire only what has become redundant.

This avoids the common migration fantasy in which the streaming platform somehow becomes the source of truth overnight. It should not. During migration, authoritative truth often remains in operational stores and existing curated warehouse models. The streaming path earns trust by matching, then surpassing, the old path.

Reconciliation is not optional

Reconciliation deserves more respect than it gets. It is not a temporary migration crutch. It is a permanent architectural capability.

You need jobs that:

  • compare materialized analytics to source-of-record snapshots
  • detect missing or duplicated events
  • backfill late-arriving facts
  • recompute windows after schema or logic changes
  • issue compensating corrections downstream

Without reconciliation, every event-driven analytics platform eventually accumulates invisible debt. Small mismatches become accepted folklore. Then an audit arrives, or a regulator, or a CFO.

And folklore is not a control framework.

Enterprise Example

Consider a global retailer trying to build a real-time sales and inventory control tower.

They operate e-commerce sites, thousands of stores, multiple fulfillment centers, and regional ERP systems inherited through acquisition. The existing analytics stack is classic enterprise architecture: nightly ETL into a warehouse, intraday reports refreshed every few hours, and lots of spreadsheet heroics in between.

The business asks for:

  • live sales performance during promotions
  • stockout risk alerts
  • inventory available-to-promise visibility
  • fulfillment exception monitoring
  • regional margin snapshots

The first instinct is to stream everything. Point-of-sale events, web orders, inventory movements, transfer orders, returns, supplier ASN messages, warehouse scans, pricing updates, customer service actions. Throw it all into Kafka, process instantly, and let the dashboards sing.

That instinct is wrong.

Why? Because the semantics differ.

A store sale is immediate enough for promotion monitoring. Inventory available-to-promise is not just a subtraction of sales from stock. It depends on reservations, transfers, returns in transit, unposted adjustments, and timing rules that vary by channel. Margin is even trickier because cost updates, markdown accruals, and promotions settle asynchronously.

The retailer eventually lands on a sensible architecture:

  • Kafka topics by bounded context: Sales, Inventory, Fulfillment, Pricing
  • event contracts with explicit business milestones
  • 5-second to 30-second micro-batch processing for sales KPIs and operational dashboards
  • immediate event-by-event processing only for specific stockout and fraud alerts
  • periodic reconciliation against ERP inventory positions and financial systems
  • materialized views in a hot analytical store for dashboards
  • lakehouse retention for replay and backfill

It is fast where speed matters and cautious where truth arrives slowly.

This architecture also handles acquisition reality. Newly acquired regions often cannot emit pristine domain events on day one. CDC from legacy databases can bridge the gap, with semantics normalized gradually into the enterprise event model. That is domain-driven migration in the real world: respect the boundaries you have, move toward the boundaries you want.

Operational Considerations

This pattern looks straightforward in diagrams. It becomes serious work in operations.

Schema governance

If events are contracts, treat them that way. Version schemas. Define compatibility rules. Document semantics, not just fields. “netAmount” without tax treatment, currency basis, or discount rules is a bug wearing a name tag.

Idempotency and duplicates

At-least-once delivery is common. Your consumers must handle duplicates safely. Materialized views need idempotent upserts or deduplication keys. Otherwise, retries turn into phantom sales and double-counted metrics.

Ordering assumptions

Kafka preserves order within a partition, not globally. If your metric assumes strict sequencing across entities, challenge that assumption early. Many enterprise bugs begin with hidden dependency on total order.

Late data and watermarking

Late arrivals are normal. Design for them. Windowing strategies should define how long data can arrive late, when a window is considered complete, and how corrections are emitted. Different use cases need different tolerances.

Backpressure and lag

A dashboard that claims to be real-time while consumers are twenty minutes behind is not real-time. Measure end-to-end freshness explicitly: event occurrence to dashboard visibility. Publish this as an operational SLO.

Replayability

Can you replay one topic partition? One day? One business key? One corrected transformation version? If replay is a heroic exercise, the architecture is incomplete.

Observability by business flow

Technical metrics are necessary but insufficient. You also need business observability:

  • events produced vs expected by domain
  • ratio of reconciled corrections
  • freshness by KPI
  • divergence from source-of-record
  • percentage of late events
  • dropped or quarantined records by cause

When streaming systems fail, they often fail semantically before they fail technically.

Tradeoffs

This architecture is not free. It is a compromise, though a very useful one.

The big benefit is pragmatic latency. You give the business a living picture of operations without demanding the full cost and fragility of universal event-by-event processing. You gain replay, decoupling, and better support for modern use cases like control towers, operational analytics, and feature generation.

But you also introduce complexity:

  • more moving parts than batch
  • more semantic governance needed than raw CDC
  • more reconciliation work than teams expect
  • dual-truth periods during migration
  • pressure to misuse streams as transaction systems

There is also a political tradeoff. Once the business sees fast dashboards, they may assume every number should update instantly. Architects need to push back. Not everything deserves the same latency budget. This is where good architecture becomes a form of financial discipline.

A useful rule is this: optimize for decision timing, not for technological excitement.

Failure Modes

The most common failure modes are painfully consistent.

1. Raw event swamp

Teams publish everything without semantic curation. Kafka becomes a fast junk drawer. Downstream consumers reverse-engineer meaning from unstable payloads. Progress stalls.

2. No authoritative reconciliation

The streaming path becomes “good enough” until material discrepancies emerge. Nobody can explain them. Confidence collapses.

3. Overspecified latency targets

The organization mandates sub-second updates for all metrics. Costs spike, pipelines become brittle, and little business value is gained.

4. Windowing without domain understanding

Aggregations are built on processing time rather than event or effective time where the domain requires it. Metrics look right in demos and wrong in month-end review.

5. Shared enterprise topic fantasies

A central team creates giant canonical topics meant for everyone. They become abstract, slow to evolve, and detached from bounded contexts. Local teams route around them.

6. Stream processor as hidden monolith

One giant processing application handles all transformations. It couples domains, deployment cycles, and failure blast radius. The old monolith returns wearing event-driven clothing.

The cure for most of these is not a new platform. It is sharper domain boundaries and more honest operational design.

When Not To Use

Do not use this pattern everywhere.

If the business is perfectly happy with daily or hourly reporting, traditional batch or warehouse ELT may be simpler and cheaper.

Do not use micro-batch “real-time” analytics where hard real-time control is required. Industrial safety systems, trading engines, and certain telecommunications control loops have latency and determinism requirements that demand specialized architecture.

Do not force event streaming into domains where the source systems cannot produce meaningful business events and where CDC plus periodic ingestion already satisfies the need. Sometimes adding Kafka simply creates more places for data to be wrong.

Also do not use this pattern if the organization lacks the discipline for event contracts, operational ownership, and reconciliation. A streaming platform without semantic governance is a machine for manufacturing expensive ambiguity. ArchiMate for governance

Several related patterns fit naturally around this approach.

  • CQRS: useful when read models for dashboards differ significantly from operational write models.
  • Event Sourcing: sometimes attractive, but often unnecessary for analytics. Use carefully; it is not the same as publishing domain events.
  • Lambda Architecture: historically combined batch and speed layers; many teams now prefer simpler unified approaches, but the core idea of correction via recomputation still matters.
  • Kappa Architecture: stream-first replay model; practical only if replay and semantic correction are genuinely mature.
  • Data Mesh: relevant when domains own analytical data products, but only if ownership includes semantic quality and interoperability.
  • Strangler Fig Pattern: central to migration from nightly batch to bounded-context event-driven analytics.

Here is a more detailed view of the serving path and reconciliation loop:

Diagram 3
Related Patterns

That final arrow matters more than many architects admit. Correction is part of the design, not an embarrassment after the fact.

Summary

Most enterprise real-time analytics is not truly real-time. It is micro-batch with good manners.

That is not a weakness. It is often the right answer.

The business usually needs bounded freshness, not physics-defying immediacy. Domain semantics matter more than transport speed. Kafka and microservices can provide an excellent backbone, but only when events are grounded in bounded contexts and when the architecture includes materialization, replay, and reconciliation. Progressive strangler migration is the sane path from legacy batch. Trust is earned in parallel runs and correction loops, not in launch presentations.

The strongest architectures in this space make a clear distinction between:

  • immediate operational actions
  • short-window analytical refresh
  • authoritative reconciliation

Get that distinction right and the system becomes fast enough, trustworthy enough, and operable enough to survive enterprise reality.

Get it wrong and you will build a dashboard that moves like lightning and lies like a poet.

And enterprises, sooner or later, always notice the difference.

Frequently Asked Questions

What is enterprise architecture?

Enterprise architecture aligns strategy, business processes, applications, and technology in a coherent model. It enables impact analysis, portfolio rationalisation, governance, and transformation planning across the organisation.

How does ArchiMate support architecture practice?

ArchiMate provides a standard language connecting strategy, business operations, applications, and technology. It enables traceability from strategic goals through capabilities and services to infrastructure — making architecture decisions explicit and reviewable.

What tools support enterprise architecture modeling?

The main tools are Sparx Enterprise Architect (ArchiMate, UML, BPMN, SysML), Archi (free, ArchiMate-only), and BiZZdesign. Sparx EA is the most feature-rich, supporting concurrent repositories, automation, scripting, and Jira integration.