BPMN Gateways Explained: XOR, OR, AND | NILUS

⏱ 22 min read

Monday, 9:12 a.m. A program manager is already frustrated.

A citizen’s emergency housing support application is showing three statuses in the case portal at the same time:

awaiting documents
under fraud review
approved for payment

Operations thinks the workflow engine has misfired. Delivery says it is probably an integration timing issue between the document portal and the payment platform. Policy insists the state is impossible. “That should never happen” is the exact phrase, which, in government, usually means it has in fact been happening for some time.

The architect opens the BPMN. BPMN training

The problem is visible in under two minutes.

Not because the diagram is ugly. Plenty of ugly diagrams run perfectly well. The problem is that the gateway logic is wrong. The model says one thing, the runtime does something else, and the messy operational space in between gets filled with retries, spreadsheets, manual overrides, queue hacks, and contradictory audit trails.

This is the part people consistently underestimate. In public-sector transformation, gateway mistakes are not notation mistakes. They are production design mistakes. They become duplicate payments, missed eligibility checks, unlawful decisions, and case states nobody can defend in front of audit or tribunal review. Sometimes they become citizen harm, which, frankly, is a phrase architects should use more often than we do.

Tasks tell you what work gets done.

Gateways tell you what the institution is prepared to risk.

That may sound dramatic. After enough modernization programs, though, I think it is basically true.

The process behind the incident

The workflow in question is familiar to anyone who has spent time in benefits, grants, housing, licensing, or similar government services. A citizen submits an online application. The platform runs identity verification through an external IAM or identity provider. It checks initial eligibility through rules. If documents are missing, it requests them through a document portal. A fraud screen may trigger. A supervisor may need to review edge cases. If everything clears, payment is scheduled. If the case is rejected, appeal rights are triggered. Some events arrive synchronously; many do not.

Under the covers, the architecture is split the way these environments usually are:

CRM or case management holds the canonical case record
a document repository manages uploads and correspondence
a rules engine calculates entitlement
a payment platform handles disbursement
an event bus, often Kafka, carries status updates and external notifications
human tasks sit in one or two work queues nobody fully trusts

The BPMN model orchestrates all of it in a workflow engine.

Some steps are request/response and feel neat in workshops. A lot of the real process is neither neat nor immediate. Identity callbacks arrive late. Documents come in after deadlines. Fraud review pauses for days. Payment validation fails after approval. A citizen withdraws while a case worker is still sitting in the review screen. The process is long-running, exception-heavy, and full of partial information.

Which is exactly why government is the right place to explain gateways properly.

This is where simplistic flowchart thinking starts to fail.

The trap: teams think gateways are just diamonds

Most teams learn BPMN visually. They see a diamond and think “decision point.” That is not wrong. It is just incomplete in a way that becomes dangerous very quickly. BPMN and UML together

A BPMN gateway is not decoration. It defines execution semantics.

In practice, that means it answers operational questions such as:

Which paths are logically possible?
Which paths are actually allowed to execute?
Who or what resolves the choice?
Can several paths run at once?
Are we deciding from current data, or waiting for something to happen?
If branches split, what exactly allows the process to continue later?

This is where transformation programs drift off course. Analysts model policy language literally instead of modeling runtime behavior. They take a sentence like:

> If the applicant is eligible and identity is valid and there is no fraud concern and all required documentation is present…

…and turn it into a stack of diamonds with labels that sound sensible in a workshop, but mean very little once the engine starts executing against asynchronous data, delayed events, and human queues.

I have seen whole programs spend months refining task names while leaving gateway semantics vague. Then everybody acts surprised when implementation turns brittle.

A BPMN model that cannot answer “what happens if two things are true at once?” is not finished.

Start with XOR, because that’s where most teams think they are

The gateway most people are comfortable with is XOR: exclusive choice.

At runtime, XOR means exactly one outgoing path is taken. The choice is resolved from the data available at that moment. It is the classic “this outcome or that outcome, not both.”

For a government triage example:

if identity confidence score is below threshold, route to manual verification
else if household income exceeds the limit, reject
else continue to entitlement calculation

That is a good XOR use case. The outcomes are intended to be mutually exclusive. One branch should win. There is a clear next state.

Here is a simple way to picture it:

Diagram 1 — Start with XOR, because that’s where most teams think they are

XOR works well when the policy outcome is singular and deterministic. It is especially useful when you have clearly ranked conditions and a controlled handoff into one next step.

But real programs misuse XOR all the time.

The common failure modes are familiar:

conditions are not truly exclusive
precedence is implied by order, but never documented
the “default” path goes somewhere operationally dangerous
XOR gets used where policy actually allows multiple obligations to be triggered at once

That last one matters more than teams think.

A case may need a document request and a fraud review. Or an accessibility support task and a municipality notification. Teams still force these into XOR because they are trying to keep the diagram simple. What they actually do is hide complexity in code, or worse, in human workarounds.

My view is pretty firm here: if you use XOR to simplify away real concurrent obligations, you are not simplifying the process. You are pushing the complexity somewhere less visible and much less governable.

A few practical rules help:

Pair XOR with explicit business rule ownership.
Define the default flow carefully.
Test overlap in conditions.
If the policy changes often, externalize the decision logic into DMN or a rule service.
Don’t hide ambiguity in labels like “eligible?” or “check result.”

Those labels are often a quiet sign that nobody has really pinned down the decision.

A table worth keeping

Early in architecture reviews, I like to use a blunt table like this. It cuts through theory quickly.

It is not exhaustive. But it is honest, and in my experience that matters more in review sessions than completeness for its own sake.

OR gateway: the one people avoid, then rebuild badly in code

OR is where BPMN starts to look like real administration. TOGAF roadmap template

An inclusive gateway means one or more paths may be taken. Not exactly one. Not necessarily all. Some combination, depending on the facts of the case.

That makes people uncomfortable because it forces teams to admit that public-sector processes are additive. Obligations stack. Exceptions are normal. Side reviews happen alongside primary processing. The neat “happy path plus one exception path” model is usually fiction. ArchiMate in TOGAF

Take application quality handling. Depending on the case, you may need one or more of the following:

request additional documents
initiate fraud review
notify a municipality of shared responsibility
create an accessibility support task for assisted completion

Some applications need one. Some need three. A small number need all four. Plenty need none.

That is an OR pattern.

And yet teams bend over backwards to avoid using OR. They replace it with chains of XOR gateways. Or they use AND because someone says “better to be safe and run everything.” Or they hide the activation logic inside task code, where the BPMN stops telling the truth.

All three are bad habits.

If you chain XORs, you imply single-path branching in a process that actually allows multiple obligations. Cases get routed incompletely. If you use AND, you trigger work that policy never required and operations never wanted. If you bury the logic in code, the model becomes ceremonial. It stops being an architecture artifact and turns into wall art.

A better mental model is this: OR is often the right gateway when obligations are conditionally additive.

That describes a great deal of government work.

Diagram 2 — BPMN Gateways Explained: XOR, OR, AND, Event-Based in Real

Now for the ugly part: OR joins.

This is where inexperienced teams get into trouble.

If you split with OR, you cannot casually merge it later and hope the engine “figures it out.” You have to reason about which branches were activated, which are still relevant, which are blocking, and what completion even means. In long-running case processing, where one branch is a human task and another is an asynchronous callback from a legacy platform, this gets subtle very quickly.

I have seen workflows deadlock because the model waited for a branch that was never activated. I have seen premature merges because the engine, or the designer, treated the OR split like an XOR split. I have also seen one branch complete days later and reopen state in ways no one anticipated.

So yes, use OR when it reflects reality. But only when you can explain the activation logic in plain language.

If you cannot say, out loud, “these are the combinations we expect, and this is what lets the process continue,” then the design is not ready.

The truth about joins

Most BPMN articles explain splits and then wave vaguely at joins. That is backwards. In implementation, joins create more incidents than splits.

Splitting is easy to imagine. Merging is where runtime semantics become real.

Some distinctions matter:

XOR split and XOR join are not the same design problem
AND join waits for all active parallel branches
OR join waits for the branches that were actually activated and are still relevant
Event-based designs often avoid clean joins altogether by changing process state instead of re-merging neatly

That last point is worth pausing on. Not every path should merge back into one tidy line. Real operations are messier than workshop diagrams.

Consider a grant approval that requires both police clearance and income verification before release. That is a straightforward AND join: both checks must complete.

Now consider a side appeal review spawned after a correction request. It may need to run, but it should not block payment correction already authorized under a separate legal basis. Forcing that back into a single blocking join is a modeling mistake, not a process requirement.

Government processes often include branches that are informational, supervisory, or advisory. They matter. They may need tracking and audit. But they should not always block the main case progression.

This is where architects earn their keep. Not by drawing more diamonds, but by deciding what actually constitutes synchronization.

The common failures are familiar enough:

accidental deadlocks
premature merge
orphaned tasks after the main case advances
hidden dependency on external SLA that nobody modeled as such

If your branch depends on a third-party response and your join assumes it will always arrive, congratulations: you have built a waiting-state incident and scheduled it for later.

AND gateway: simple on paper, dangerous in distributed reality

The AND gateway looks easy. All outgoing branches start. Parallelism is explicit. Usually, a synchronization point comes later.

In a government payment process, before releasing a high-value disbursement, you might run in parallel:

sanctions screening
bank account validation
final entitlement calculation
audit log packaging

That is legitimate AND usage. All four checks are required. There is no policy basis to release payment until they are done.

Architects like AND because it maps well to service decomposition. One workflow engine orchestrates. Several services or serverless functions execute checks. Elapsed time comes down. Independent controls become visible. Everyone feels modern.

Operations teams often dislike it for exactly the same reason.

One branch fails and the case waits. One callback is delayed and the work item sits in limbo. One service retries with duplicate events and now the engine sees two completions for one branch. Compensation logic is half-designed because people were excited about parallelism and bored by failure scenarios.

And in distributed systems, “parallel” is never just parallel. It means separate systems, separate reliability profiles, separate ownership, separate telemetry, and usually separate excuses.

In cloud-heavy implementations, the workflow engine orchestrates, Kafka or another broker carries events, Lambdas or microservices perform checks, and results return asynchronously. That can work very well. But only if you are disciplined about the basics:

branch idempotency
correlation IDs
timeout handling
mandatory versus advisory checks
branch-level observability, not just case-level status

I will be blunt: if your monitoring only tells you “case pending,” you do not understand your AND gateway in production.

You need to know which branch is pending, for how long, with what retry history, against which external dependency, and whether the branch still matters operationally.

Also, do not use AND simply because teams can work in parallel organizationally. That is a trap. The process should branch in parallel only when the control logic requires simultaneous execution, not because someone says “these departments can each do their part.”

Organizational parallelism and process parallelism are not the same thing. I have had to untangle that confusion more than once.

Event-Based gateway: where BPMN finally meets reality

If I had to pick one gateway type that is underused and misunderstood in government transformation, it would be the event-based gateway.

Government workflows spend an extraordinary amount of time waiting.

Not waiting for calculations. Waiting for things to happen.

a citizen uploads a document
a deadline expires
an applicant withdraws
a court order arrives
a payment rejection is returned
an external agency responds
an IAM verification callback lands after a delay

This is not just data-based branching. It is a different execution model.

An event-based gateway waits for one of several events. The first event received determines the path. It is ideal for long-running, externally influenced processes.

Take the missing documents scenario. After the agency sends a request for additional evidence, the case waits for one of three things:

document received event
applicant withdrawal event
response deadline timer

Whichever happens first moves the case.

That is not XOR with a status field.

And yet teams model it that way all the time. They create a status in the CRM, then build polling logic or periodic checks that ask “has document arrived?” “has deadline passed?” “has applicant withdrawn?” Then they wonder why they get race conditions, duplicate reminders, stale statuses, and audit trails that make no temporal sense.

If your process is waiting for an external occurrence, model it as waiting for an external occurrence.

That sounds obvious. In practice, it rarely is.

Here is the shape of it:

Diagram 3 — BPMN Gateways Explained: XOR, OR, AND, Event-Based in Real

In production, event-based gateways require more architectural discipline than teams expect.

You need:

a correlation strategy so the right event wakes the right case
durable timers
message idempotency
explicit cancellation of losing event subscriptions
tolerance for bad event quality from external systems

This is where Kafka often enters the conversation. Kafka is excellent for event distribution and replay, but it does not magically solve process correlation. You still need a case key strategy, event contracts, deduplication handling, and a clear answer to what happens when an event arrives late or twice.

The same is true with IAM integrations. An identity provider callback that arrives after a timeout branch has already progressed is not unusual. Your process model needs to know whether to ignore it, record it, reopen the case, or trigger a remediation path. If nobody has thought that through, the gateway design is incomplete.

I have heard teams call these “edge cases.” They are not edge cases. They are normal operating conditions in asynchronous public services.

Back to the incident: what was actually wrong

That emergency housing support case with three statuses at once? The BPMN had several gateway flaws layered on top of each other.

First, XOR was used where OR was needed. Missing documents and fraud review were modeled as mutually exclusive, even though some applications legitimately required both. So the engine would route one way initially, then custom code would awaken another obligation later, leaving contradictory active states behind.

Second, an AND gateway had been placed before payment release, but one of the branches was an optional manual review path that only existed for certain risk conditions. The join logic assumed all branches mattered equally. Some cases sailed through too early; others hung forever.

Third, timeout behavior for document submission was modeled as a data condition checked later, not as an event-based wait. That led to polling, duplicate notices, and a race between document arrival and deadline handling.

Finally, there was no clean join semantics after the exceptional review path. Work completed out of sequence. Notifications fired twice. Payment advanced while review remained active. Operations started reconciling it in spreadsheets because that is what operations nearly always does when architecture leaves a gap.

This is what I mean by shadow process architecture.

When the formal workflow cannot represent reality, the real process moves into inbox rules, ad hoc queues, SQL extracts, side conversations, and tribal knowledge.

And then the organization says it has modernized.

Mistakes architects keep making

It is worth being direct about the recurring errors.

Drawing policy, not execution.

Legal text says several things may happen. Fine. The model still has to show who decides, when, based on what information, and whether several outcomes can coexist.

Confusing optional with parallel.

An optional branch is not an AND branch. “Could happen” is not the same as “must start now.”

Avoiding OR because the engine team is nervous.

This is common. OR semantics feel harder, so teams push them into code or rules or manual operations. The BPMN gets simpler; the platform gets more fragile.

Treating timers as decorations.

A statutory deadline is not a note on the side of a task. It is an event with legal and operational consequences.

Ignoring join semantics.

The split is only half the design. In practice, less than half.

Modeling human work as deterministic.

Manual tasks in government get paused, reassigned, escalated, returned, abandoned, reopened, and occasionally lost in a work queue that somebody swears is being monitored.

No branching test cases.

Teams review diagrams visually. They do not execute realistic scenario combinations. Then production becomes the test harness.

I have a personal rule: if a process has more than trivial branching and nobody can walk through ten realistic combinations end to end, the model is not mature enough to automate.

How to choose the right gateway in architecture reviews

You do not need mystical BPMN expertise. A few review questions usually expose the right choice.

Ask them in this order:

Is exactly one path valid now?
Can more than one path be active?
Must all of them start?
Are we waiting for an external event rather than evaluating current data?
If multiple branches start, what exactly lets us continue later?

That sequence matters. It moves the conversation from notation to runtime behavior.

Some useful heuristics:

if you say “one or more,” consider OR
if you say “all of these checks,” consider AND
if you say “whichever happens first,” consider Event-Based
if you say “this outcome or that outcome, not both,” consider XOR

In architecture governance, I like requiring one sentence of rationale per gateway. Just one sentence. Not a page. EA governance checklist

Something like:

“XOR because triage outcomes are mutually exclusive and resolved from current case data.”
“OR because document request and fraud review may both be required.”
“Event-Based because the case waits for citizen response, withdrawal, or timer expiry.”
“AND because all payment controls must complete before release.”

If nobody in the room can explain a gateway verbally, it should not pass design review.

That may sound harsh. It saves a lot of pain later.

Patterns that show up in government modernization

You see recurring combinations.

Digital intake plus manual exception lane.

XOR handles standard routing. OR triggers additive reviews. Event-Based waits for citizen response windows. This is probably the most common shape in benefits and housing.

Licensing or permitting with external agency dependencies.

AND is useful for concurrent validations. Event-Based handles third-party responses and statutory deadlines. Often there is a long wait state in the middle that people initially try to model as status checks.

Grants management.

OR is often the right answer for compliance activities triggered by funding conditions. XOR covers award, reject, revise outcomes. AND handles pre-disbursement controls.

Appeals and reconsiderations.

Event-Based is underrated here. After a decision notice, the process may wait for appeal submission, lapse of the appeal window, or receipt of a correction event. That is cleaner than inventing awkward status polling.

These patterns are not theoretical. They show up in actual transformation backlogs, and usually sooner than teams expect.

Tooling and execution concerns people leave out

BPMN articles often stop at semantics and pretend engine behavior is a detail. It is not.

Not all workflow engines handle complex OR semantics clearly. Some do. Some technically support them but make runtime behavior opaque. If your platform team cannot explain how the engine persists waiting states, resolves joins, and handles duplicate messages, you are not ready for mission-critical event-based government services.

That is not snobbery. It is survival.

A few implementation concerns matter more than teams usually admit:

event broker versus direct API callback
correlation keys across systems
retries and duplicate events
compensation after partial completion
branch-level metrics
timer monitoring
auditable state transitions

And yes, cloud architecture changes the conversation. Serverless functions are great for isolated checks. Kafka is great for decoupled event flows. IAM services are great for external identity assertions. None of them removes the need to model gateway semantics correctly. If anything, they make correctness more important, because asynchronous behavior amplifies bad assumptions.

Keep the diagrams honest

A few practical habits help more than sophisticated notation debates ever do.

Name gateways so they express decision or wait semantics clearly. Avoid labels like “check status” or “process decision.” Those are placeholders pretending to be design.

Keep policy and process connected, but separate. Volatile decision logic belongs in DMN or rule services. Flow control belongs in BPMN. When everything gets jammed into gateways, the model becomes unreadable. When everything gets pushed into code, the model becomes dishonest.

Model for operations, not just workshops. Include exceptions that actually happen. Represent timers explicitly. Capture manual intervention paths. If a branch exists only because “that sometimes happens and ops usually handles it,” then it probably belongs in the model.

And review with mixed audiences:

policy owner
operations lead
workflow engineer
compliance or audit representative

That mix catches different kinds of lies.

Also, test scenarios. Real ones.

no documents received
documents arrive after deadline
fraud review and accessibility support both triggered
payment validation fails after approval
appeal arrives before disbursement completes

If the model cannot survive those conversations, it will not survive production.

When not to lean too hard on BPMN gateways

Not every branching problem belongs in BPMN.

Some decision complexity belongs in a rules engine. Some highly unpredictable work is better handled as case management. Some event choreography should remain decentralized across services instead of being centrally orchestrated.

There is a government nuance here that matters: if the work is dominated by human discretion and emergent behavior, forcing every branch into BPMN can create false precision. The diagram looks rigorous. The operation remains fluid. That gap is dangerous too.

Sometimes the most mature architectural choice is to stop pretending the process is fully predetermined.

What changed after the redesign

Back to the housing-support workflow.

The redesign was not glamorous. No one got excited about it on a steering call. But it worked.

OR replaced XOR where obligations were additive. Event-Based gateways were introduced for document response, withdrawal, and deadline waiting. AND was narrowed to genuinely mandatory pre-payment controls. Joins were redesigned around active branch semantics rather than diagram neatness. Operational dashboards were updated to show branch-level state, not just a single case status.

The results were exactly the sort of results good architecture usually produces: not flashy, just sane.

fewer contradictory case states
cleaner audit trail
less manual reconciliation
fewer duplicate notifications
more confidence from policy teams
better shared language between architecture and operations

That last one is underrated.

When gateway choices are right, the model starts matching how the service actually behaves. And once that happens, conversations improve. Policy can see its intent. Engineers can see execution semantics. Operations can see waiting states and obligations. Audit can see who decided what, when, and why.

That is the real value.

Gateway choice is not notation trivia. In government transformation, it is one of the clearest indicators of whether the architecture understands the real process or only the workshop version of it.

And if you have been around enough modernization programs, you learn to tell the difference very quickly.

Frequently Asked Questions

What is BPMN used for?

BPMN (Business Process Model and Notation) is used to document and communicate business processes. It provides a standardised visual notation for process flows, decisions, events, and roles — used by both business analysts and systems architects.

What are the most important BPMN elements to learn first?

Start with: Tasks (what happens), Gateways (decisions and parallelism), Events (start, intermediate, end), Sequence Flows (order), and Pools/Lanes (responsibility boundaries). These cover 90% of real-world process models.

How does BPMN relate to ArchiMate?

BPMN models the detail of individual business processes; ArchiMate models the broader enterprise context — capabilities, applications supporting processes, and technology infrastructure. In Sparx EA, BPMN processes can be linked to ArchiMate elements for full traceability.