Julien Danjou Julien Danjou
April 13, 2026 · 8 min read

How We Formally Verified Our Merge Queue with TLA+ (and Found Bugs That Tests Missed)

How We Formally Verified Our Merge Queue with TLA+ (and Found Bugs That Tests Missed)

We wrote a TLA+ specification for our merge queue state machine. TLC explored 468,000 states and found two bugs that our test suite and years of production traffic had missed.

We recently refactored the state machine at the core of our merge queue. Hundreds of tests passed, but we still didn’t trust the result. So we wrote a TLA+ specification, pointed TLC (the TLA+ model checker) at it, and let it explore 468,000 distinct states in under a minute. It found bugs that had been latent in production for years, never triggered but waiting for the right sequence of events.

Internally, we model the queue as a train: an ordered sequence of cars, where each car holds one or more PRs. A car moves through states (preparing, waiting for CI, waiting to merge) and eventually either merges or gets ejected. On top of that basic flow, the queue supports batching (grouping multiple PRs into a single CI run), freezing (temporarily blocking a PR from merging, e.g. during a deploy freeze), and speculative CI (testing a PR against the expected result of merging everything ahead of it, before those merges actually happen). These features interact in ways you won’t catch by staring at Python code.

Why TLA+ and not more tests

Consider this scenario: a PR passes CI and is ready to merge, but then a freeze rule activates on it. Does the queue catch this and block the merge? Or does the merge fire before the freeze takes effect? This is the kind of question you’d only think to ask after the bug hits production.

With 8 PRs, freeze toggling, CI pass/fail for each car, and batching, the number of possible states grows faster than anyone could enumerate by hand. Our test suite checks scenarios we imagined. TLC checks every reachable state of the model, trying every possible ordering of events automatically.

I’d never written TLA+ before this. I used Claude to generate the initial spec from our Python codebase, then iterated. Run TLC, read the counterexample (the step-by-step trace showing how an invariant was violated), refine the model, repeat. Claude got the basic structure right but missed modeling subtleties (how batches split into sub-batches on failure, how cars get deleted atomically on merge). It took about a week of iteration to get a spec that TLC could check meaningfully.

What to model, what to skip

The hardest part wasn’t the TLA+ syntax. It was deciding what to leave out.

Our merge queue implementation spans four Python modules: the train (the ordered queue), the car (a single queue entry holding one or more PRs), the car outcome (what happens after CI), and the convoy lock (concurrency control). The spec mirrors this structure but focuses on the train and cars.

The state is compact: the train (the ordered sequence of car IDs), a cars map (car ID to its PRs and current state), and three sets tracking where every PR is: waiting, merged, or ejected.

VARIABLES
    train,          \* Ordered sequence of top-level car IDs
    cars,           \* Function: car_id -> car record
    waiting,        \* Set of PRs waiting to be batched
    merged,         \* Set of PRs that have been merged
    ejected,        \* Set of PRs that have been ejected
    freezeAffected  \* Set of PRs currently affected by a freeze rule

What we deliberately left out:

  • Batching delay (WAITING_FOR_BATCH). A timer that waits before grouping PRs. Doesn’t affect ordering correctness.
  • Merge condition re-checks (CONDITIONS_FAILED). Separate from queue safety.
  • GitHub API and git operations. IO concerns, not state logic.
  • Parallel queues. Teams can run separate queues by scope (e.g., frontend vs. backend). We modeled a single queue with a scope constraint on batching instead. Simpler, same invariants.

This is the main judgment call. Too much detail and TLC can’t finish exploring the state space (it grows exponentially with each variable). Too little and you miss real interactions. We erred toward less: if a feature doesn’t affect ordering or safety, it stays out. The tradeoff is that bugs hiding in the interaction between modeled and unmodeled features won’t be caught, so you need to choose your abstraction boundary carefully.

Here’s the simplified state machine we ended up modeling. Each car moves through these states from entry to either merge or ejection:

stateDiagram-v2
    [*] --> preparing
    preparing --> waiting_for_ci
    preparing --> error

    waiting_for_ci --> waiting_for_merge : CI passes
    waiting_for_ci --> batch_split : CI fails (batch)

    batch_split --> preparing : re-queued individually

    waiting_for_merge --> frozen : freeze activates
    waiting_for_merge --> [*] : merged

    frozen --> waiting_for_merge : freeze deactivates

    error --> [*] : ejected

Invariants that catch real bugs

TLC checks invariants at every reachable state: boolean formulas that must be true at all times. Here are the ones that mattered most.

NoPRLost: every PR is always accounted for, either waiting, assigned to a car, merged, or ejected.

NoPRLost == AllAccountedPRs = PRs

One line. If any action accidentally drops a PR (say, during a batch split or when moving a PR back to the waiting set), TLC flags it immediately.

NoDuplicatePR: no PR appears in two places at once. When batches split on failure and PRs get reshuffled, it’s easy for a PR to end up both in a sub-batch and back in the waiting set.

NoDuplicatePR ==
    /\ merged \cap waiting = {}
    /\ merged \cap ejected = {}
    /\ merged \cap AllAssignedPRs = {}
    /\ ejected \cap waiting = {}
    /\ ejected \cap AllAssignedPRs = {}
    /\ waiting \cap AllAssignedPRs = {}

Six checks that no PR appears in two sets at once.

NoFrozenPRMerging: if a PR is frozen, its car should never be in the waiting_for_merge state. If a freeze activates after CI passes, the car must transition to frozen before the merge fires.

NoFrozenPRMerging ==
    \A cid \in DOMAIN cars :
        cars[cid].state = "waiting_for_merge"
        => \A pr \in cars[cid].prs : pr \notin freezeAffected

This is the one TLC broke.

ScopeConsistency: all PRs in a car share the same scope. If your team uses scopes to isolate frontend and backend PRs, this invariant makes sure they never end up in the same batch.

ScopeConsistency ==
    \A cid \in DOMAIN cars :
        \A p1, p2 \in cars[cid].prs : PRScope[p1] = PRScope[p2]

One unexpected lesson: two of our invariants (OrderingRespected and SubBatchMergeOrdering) turned out to be vacuously true. They referenced a "merged" car state that doesn’t exist in the model, because cars are atomically deleted when merged. The invariants were technically satisfied at every state but checking nothing. Dead invariants are a modeling smell, and they’re easy to miss because TLC won’t complain about them.

Bug #1: Frozen PRs slipping through to merge

TLC produced a counterexample trace for the NoFrozenPRMerging violation. Simplified, it looks like this:

  1. A car with PR #3 passes CI and enters waiting_for_merge
  2. FreezeActivate(3) fires, adding PR #3 to freezeAffected
  3. MergeHead fires. The car is at position 1, state is waiting_for_merge. Merge proceeds.

The problem is a two-step process with a gap. First, FreezeActivate marks the PR as freeze-affected, but doesn’t change the car’s state. Then a separate Freeze action is supposed to notice this and transition the car to frozen. But between those two steps, MergeHead can fire. The car is still in waiting_for_merge, so the merge proceeds, ignoring the freeze entirely.

sequenceDiagram
    participant CI
    participant Car as Car with PR 3
    participant Freeze as Freeze System
    participant Merge as Merge Action

    CI->>Car: CI passes
    Car->>Car: state = waiting_for_merge
    Freeze->>Freeze: FreezeActivate(PR #3)
    Note over Freeze: freezeAffected = {3}
    Note over Car: state still waiting_for_merge
    Merge->>Car: MergeHead fires
    Note over Merge: Bug: merges despite freeze

In the Python code, this corresponded to a real gap. When a freeze activated after CI had already passed, there was no mechanism to re-evaluate the car’s state before the merge path ran. The car sat in WAITING_FOR_MERGE with a freeze-affected PR, and nothing kicked it back to FROZEN.

Could a targeted test have caught this? Maybe, if someone had thought to write a test for “freeze activates between CI pass and merge.” But the whole point is that nobody thought of it. The state machine had been running in production for years without this particular sequence of events occurring. TLC found it because it doesn’t need someone to imagine the scenario first.

My first fix attempt scattered defensive checks at every point where a car could advance toward merging. My colleague Mehdi took a better approach: instead of guarding every downstream decision point, he added a single check at the source. When any PR in a car is freeze-affected but the car’s state is still waiting_for_merge, the car gets re-evaluated and transitions to frozen before anything else can act on it.

Fixing it at the root is always better than scattering guards. The TLA+ spec now includes freeze guards on both MergeHead and SpeculativeSkip:

MergeHead ==
    /\ Len(train) > 0
    /\ LET cid == train[1] IN
       /\ cars[cid].state = "waiting_for_merge"
       /\ \A pr \in cars[cid].prs : pr \notin freezeAffected
       /\ merged' = merged \cup cars[cid].prs
       ...

The \A pr \in cars[cid].prs : pr \notin freezeAffected guard is what was missing. Without it, TLC finds the race.

Bug #2: Speculative skip promoting untested code

Our merge queue has a performance optimization called skip_intermediate_results. Say three cars are queued: A, B, C. Each car’s CI run includes the changes from all cars ahead of it, so car C’s CI run effectively tests A+B+C together. If C passes, we can skip waiting for A’s and B’s individual CI results and promote them straight to waiting_for_merge.

The bug: the promotion code promoted all intermediate cars, including those still in preparing state. A car in preparing hasn’t created its merge branch or draft PR yet. Promoting it to waiting_for_merge means the merge code runs against a car with no branch to merge, no CI metadata, and no queue pull request number. The merge path silently falls through to the wrong code branch, and lifecycle signals get dropped because the fields they check are all None.

The fix was a precondition check: only promote cars that have reached waiting_for_ci or later, where the merge branch and draft PR actually exist. A simple unit test could have caught this, but nobody had thought to write it. TLC found it as a byproduct of exhaustive checking: it doesn’t distinguish “hard to imagine” bugs from “easy but overlooked” ones. It just tries everything.

What we learned

The spec lives in our repo now, alongside the code it models. When the state machine changes, the spec gets updated. This is a real maintenance burden: 1,098 lines of TLA+ for roughly 2,000 lines of Python state machine. If the spec drifts from the code, TLC’s green light gives false confidence. We don’t have a CI gate for this yet.

Every future refactor of the state machine can now be checked against the same invariants. The freeze race bug was latent for years; the next one might not be. Running TLC after a change takes under a minute and covers the full state space of the model (within the configured bounds: 8 PRs, 2 scopes, 3 parallel checks).

The dead invariants lesson stuck with me the most. We had invariants that TLC “verified” at every state, but they were checking nothing because they referenced states the model never reaches. Writing the invariant is only half the job. You also need to confirm that TLC can actually violate it when the property breaks. A passing TLC run you can’t break on demand isn’t evidence of safety. It’s just silence.

Merge Queue

Tired of broken main branches?

Mergify's merge queue tests every PR against the latest main before merging. Try it free.

Learn about Merge Queue

Recommended posts

A Markdown File Became Our Company-Wide On-Call Cheat Code
April 14, 2026 · 11 min read

A Markdown File Became Our Company-Wide On-Call Cheat Code

How a staff engineer turned scattered tribal knowledge into a git repo with Claude Code that lets any team member run a six-system support investigation in two minutes.

Julian Maurin Julian Maurin
Python 3.14 in Production: What PEP 649 Actually Breaks
April 14, 2026 · 8 min read

Python 3.14 in Production: What PEP 649 Actually Breaks

PEP 649 defers annotation evaluation. That's great until FastAPI tries to resolve your TYPE_CHECKING imports at runtime and every endpoint throws NameError.

Thomas Berdy Thomas Berdy
Diving into pytest Finalizers
April 10, 2026 · 5 min read

Diving into pytest Finalizers

While building pytest-mergify, we hit a wall with fixture teardown during test reruns. The fix was two helper functions and a trick borrowed from pytest-rerunfailures.

Rémy Duthu Rémy Duthu