Skip to content
Julien Danjou Julien Danjou
April 25, 2026 · 5 min read

A merge queue is critical infrastructure. Build it accordingly.

A merge queue is critical infrastructure. Build it accordingly.

On April 23, GitHub's merge queue silently corrupted merges for four and a half hours. The failure mode is structural, and it shows what it takes to build a merge queue at the level of critical infrastructure.

On April 23, GitHub’s merge queue silently reverted merged code for about four and a half hours. Pull requests that passed CI ended up on main with the wrong contents. Branch protection ran and the PR pages showed merged. Engineers found out hours later, when something on main started behaving in ways that didn’t match the diff.

This is the worst kind of failure a merge queue can produce. There’s no outage banner and the SLA monitors stay green. The audit trail looks clean too. The only signal is that the code on main no longer matches the code your team thought it merged.

How a merge queue can lie

A merge queue exists to solve one problem. When many engineers land many PRs, you can’t trust that each one will still pass when combined with everything ahead of it in the queue. The queue tests PRs in the order they will land, with their predecessors applied, and only merges them if the combined state stays green.

That works only if one invariant holds: the commit you tested is the commit you land. Same SHA, same bytes. The replay-after-CI strategy, where the queue runs CI against a temporary commit and constructs the actual merge commit at land time, produces a different commit even when the diff is identical. A different parent or a slightly different rebase moment is enough.

Branch protection doesn’t catch the drift, because branch protection runs on the temporary commit. The audit trail looks clean, because the PR went through every gate. The only way to know is to compare what main actually has to what CI signed off on, byte for byte. Most teams don’t do that.

That is the bug class GitHub hit on April 23.

Why it shows up at platform scale

A merge queue is one of those features where every edge case is a different code path. Squash, rebase, merge commits, force-pushes, conflicts that resolve mid-queue, status checks that go red between staging and landing, autoqueue, manual queue, branch protection rules that mutate the merge plan. Each of these has its own variant of the integrity invariant. Each can drift independently.

Inside a large platform, the merge queue is one square on a roadmap. The team that owns it ships a regression in one mode (say, squash combined with rebase) while tuning something adjacent in the staging step. The change passes review, it ships behind a feature flag, the flag rolls out, and a few hours later some PRs on main have the wrong contents.

This is the failure mode of treating a merge queue as a feature inside a larger product. The architecture is unforgiving regardless of how careful the team is. The blast radius of a wrong call is the main branch of every customer using the feature, and the invariants that make the feature safe live deep in the structure of the code, below the level a roadmap operates at.

What a product mindset looks like

When a merge queue is the product, the engineering trade-offs change.

You write down the invariants. You don’t ship a change that touches the merge path without a reviewer who has seen what happens when each invariant breaks. You build the architecture so the dangerous invariants are protected by structure: there is no path through the system that lands a SHA other than the one CI tested, because there is no reconstruction step at land time. The commit you merged is the commit you tested, by construction.

When we built our merge queue, we eventually verified the core algorithm with TLA+. That’s the kind of investment you make when the cost of being wrong is “the customer’s main branch is broken and they don’t know which commits to trust.” For a feature on a roadmap, formal verification is overkill. For a product whose failure mode is silent corruption, it’s the floor.

If you got bitten

For anyone who used GitHub’s merge queue during the window, the audit looks roughly like this. For each merge commit on main between 16:05 and 20:43 UTC on April 23, get the tree hash of the merge commit. Compare it to the tree hash of the PR’s tested commit, modulo whatever the merge mode normalizes. Anything that doesn’t match is a candidate for review.

This is not a fast script. The reason you can’t just check “what was the SHA CI signed off on” is that, for the bug class we’re describing, that SHA isn’t on the branch anymore.

If you have a merge queue, regardless of who runs it, you should know how to detect this class of bug. The audit doesn’t work after the fact unless you’ve kept enough trail to reconstruct what was tested. Most teams haven’t.

Build it accordingly

Merge queues sit on the critical path of every release. When they’re wrong, the wrongness is silent and the recovery is hard.

A feature on a roadmap is built well. A product whose failure mode is silent corruption is built so that a specific class of failure cannot happen.

We make a merge queue. We’re up at night about it.

Merge Queue

Tired of broken main branches?

Mergify's merge queue tests every PR against the latest main before merging. Try it free.

Learn about Merge Queue

Recommended posts

Fake-timer leakage in Jest: the flake nobody sees coming
April 25, 2026 · 6 min read

Fake-timer leakage in Jest: the flake nobody sees coming

Why a single jest.useFakeTimers() call can pollute the next test in the file, what the failure looks like, and the three-line afterEach that makes it go away for good.

Rémy Duthu Rémy Duthu
Jest snapshot drift: when toMatchSnapshot lies about your code
April 25, 2026 · 7 min read

Jest snapshot drift: when toMatchSnapshot lies about your code

Snapshots fail every few days with diffs you did not cause. Most teams blame their Date.now() mock. The real story involves three less-obvious sources of drift, and a one-line serializer that fixes them.

Rémy Duthu Rémy Duthu
Switching from npm to pnpm found 3 phantom dependencies in our React app
April 20, 2026 · 5 min read

Switching from npm to pnpm found 3 phantom dependencies in our React app

A pnpm migration meant to speed up installs ended up exposing three phantom dependencies our React app had been shipping without declaring.

Thomas Berdy Thomas Berdy