I'm the new hire on a codebase I didn't write

A new engineer at Mergify built a daily quiz, grounded in real test assertions, to learn the codebase he ships into. He scored three out of ten on it, and it's the cheapest learning loop he has found for the era when AI types most of the code.

I joined Mergify a few months ago. I ship code into the engine almost every day. And on the first quiz I took about how that engine actually behaves, I got three answers right out of ten.

That quiz is something I built. It’s a Claude Code skill that asks me one question a day, grounded in our test suite. I built it because I didn’t know my own codebase well enough to predict what it would do in ten specific scenarios, and I wanted a short, honest feedback loop to close that gap.

The problem: AI writes the code, so how do I learn the codebase?

The fastest way to understand a system is to change it. Your mental model gets built one edit at a time.

That path is narrower when Claude is the one typing. A lot of what I commit is code I reviewed and accepted. I wrote less of it than I used to, and I remember less of it, because I typed less of it. The code gets written. My mental model does not.

Where this bites hardest is pull-request review. I’ve had to review PRs touching parts of the engine I’ve never worked on. My options were limited: ask Claude to walk me through the codepath, or read it myself. Both work in the moment. Neither sticks. Reading an unfamiliar function is different from poking at it and asking yourself what happens when input X arrives in state Y.

PR review is also where I most need to be sharp. When you don’t know the codebase and Claude writes most of what lands in it, the only thing between a good diff and a bad one is your ability to tell them apart. That is not a skill you can fake.

Tests as the ground truth

The obvious move is to ask Claude to generate quiz questions about the codebase. I did not want that. A hallucinated quiz is worse than no quiz, because a wrong “correct answer” teaches you the wrong thing with confidence. I’ve seen it happen with generation-time distractors that were subtly also correct, and that was enough proof for me that ungrounded questions are a trap.

So the skill is built around a hard rule: every question has to trace back to a real assert statement in the Mergify test suite, with a file and line number shown in the reveal. Our tests cover a lot of edge cases, and they encode what the engine is supposed to do. If the quiz tells me the answer is A, I can click through to engine/mergify_engine/tests/unit/merge_protections/test_scheduled_freeze.py:308 and read the assertion myself. If I disagree with the test, that’s a different problem, but at least the conversation is about real code and not about something the model invented between two token predictions.

This is Anki with an LLM as the card author and the test suite as the source of truth. The useful difference is that the skill rescans the test tree on every run, so if we rename a file or add a new test module, the quiz picks it up automatically without me maintaining a single card.

Grounding in tests also means the quiz inherits whatever our test suite is weak on. A team without good coverage couldn’t build the same tool the same way. That’s a fair price.

A real question

Here’s one I got wrong on my second day.

A scheduled freeze is created with matching_conditions=[] (match all PRs) and exclude_conditions=[label=hotfix, label=urgent]. During the freeze window, three PRs exist: (a) no labels, (b) only label=hotfix, (c) both label=hotfix AND label=urgent. Which of these PRs are NOT frozen (i.e. allowed to merge)?

A. Only (c). exclude_conditions use AND logic: every listed condition must match for a PR to be exempt.
B. (b) and (c). exclude_conditions use OR logic: any one match is enough to exempt the PR.
C. Only (a). The freeze covers every PR that bears any matching label.
D. (a) and (b). matching_conditions=[] matches no PRs, so the freeze covers nothing except whatever the exclude list filters back in.

I picked B. The correct answer is A. The reveal pointed me at test_scheduled_freeze.py:308, which asserts p2_mp_check["status"] == "in_progress" for the PR carrying only label=hotfix. OR-logic on exclude_conditions would have let that PR through. AND-logic keeps it frozen. I’d never thought about it before, and now I do.

Under the hood

The whole thing is one SKILL.md file. Claude Code skills are effectively long prompts with scoped tool access, and the entire program here (selection logic, spaced repetition, grounding rules, history schema) runs on 321 lines of Markdown. No separate service, no database. The state lives in ~/.claude/mergify-quiz/history.jsonl, one JSON object per answered question.

Each morning, the skill walks this loop end-to-end:

flowchart TD
    A[Walk the test tree] --> B[Pick a pillar<br/>weight = mastery × novelty]
    B --> C[Pick a topic inside it<br/>weight = miss rate]
    C --> D[Filter eligible questions<br/>via spaced-rep cooldown]
    D --> E[Open the chosen test file<br/>pick a grounding assertion]
    E --> F[Generate MCQ + 3 distractors]
    F --> G[Ask me one question]
    G --> H[(history.jsonl)]
    H -. updates weights for tomorrow .-> A

The skill follows our top-level test directory layout. Each major subsystem becomes a pillar, and test file paths inside it become topic labels. For example, tests/unit/queue/merge_train/test_outcome.py becomes a queue.merge_train.outcome topic under the merge_queue pillar. Topics aren’t predefined. The skill discovers them on every run by walking the test directory and matching against the labels already in history.

Pillar selection is weighted by mastery and novelty. If I’ve been acing merge_queue lately, its weight drops. If I haven’t seen merge_protections in three weeks, its weight climbs:

weight_pillar = (1 - mastery + 0.1) * novelty_boost
novelty_boost = 1 + min(days_since_last, 30) / 30

The + 0.1 floor is there on purpose. Even a pillar I’ve nailed stays eligible, because I’ve seen firsthand that “mastery” on ten questions is not the same as mastery on the real surface area.

Topic selection inside the chosen pillar uses a different formula:

weight_topic = (1 + incorrect) / (1 + asked)

A topic I’ve never seen starts at 1/1 = 1.0, which keeps new areas in rotation. One I keep missing gets surfaced more often, and one I keep getting right becomes less frequent without disappearing.

Then there’s a spaced-repetition filter. Each question is identified by a SHA1 hash of its stem plus its four options. After I answer, the question is hidden from the pool for a window that depends on how I did:

flowchart LR
    Q[Question answered] --> R{Outcome}
    R -->|correct| C[Hidden 14 days]
    R -->|skipped| S[Hidden 7 days]
    R -->|wrong| W[Hidden 3 days]
    C --> P[Back in pool]
    S --> P
    W --> P

The 14/3/7 numbers aren’t tuned; they’re a first guess based on feel. I’ll probably revisit them once I have a few hundred entries instead of ten.

The last piece is generation. Once selection has narrowed down to a specific test file, the skill opens it, picks a grounding assertion (preferring edge-case tests with descriptive names over assertions that just check construction), frames it as a scenario-based multiple-choice question, and writes three plausible distractors. Distractors were the part I worried about most. An obviously wrong option is useless. The skill instruction list tells Claude to pull distractors from adjacent enum values or from the behavior of a related codepath in a different scenario. It works better than I expected, though I’ve caught a couple that were subtly also correct, which is its own kind of bug.

Why a skill, not a one-shot prompt

I could open Claude on any given morning and type “quiz me on scheduled_freeze edge cases, cite the tests.” That would work once. What a skill adds is the boring operational stuff: a grounding rule the model can’t forget because it’s re-read at the start of every invocation, and a persistent history across sessions so I’m not re-answering the same question on Tuesday that I got right on Monday. The daily cadence is just a workflow I’ve built around it. None of that exists if the quiz is a prompt I retype each time.

What 3/10 actually taught me

The score told me where to look. The reveal told me what I was missing.

Every answered question ends with a two-to-four-sentence explanation of why the correct answer is correct in terms of engine behavior, and a citation block pointing to the exact assertion in its test file, plus a link to the production code that assertion exercises. I’ve learned more from reading the explanation and jumping into the cited file than I have from being told whether my guess was right.

One topic I missed twice on different days was merge_protections.scheduled_freeze. The freeze question above is one of those two. I’m not going to pretend I memorized the correct answer. What stuck was the codepath. I now have a mental picture of how scheduled freezes interact with matching_conditions and exclude_conditions, because I was wrong in a specific way and then followed the assertion into the production module. A lecture wouldn’t have done that for me, and neither would an LLM chat. Being wrong on a concrete scenario, with the right answer one click away, did.

There’s also a ritual to it. I answer a question during my tea break, or while I’m waiting for a Claude session that’s too slow to multitask on but too short to start a whole new topic. Those small idle moments in AI-assisted work are new. They used to be filled by reading, or by the next compile. Now they fit exactly one quiz question.

What doesn’t work yet

The worst part of the current skill is that it doesn’t model difficulty. Common-path behavior and rare edge cases get sampled with equal probability. As someone new to the codebase, I’d much rather start with the high-traffic paths and work my way out to the weird corners once I can predict the common ones. Right now, the quiz might throw a specific race-condition scenario at me before I’ve seen the simple “what does autoqueue: true do” question.

The fix is probably a difficulty tag per test file. I’d infer it from cues like directory depth and keywords like edge or flaky in test names, and fall back to manual annotation when the heuristic misses. I haven’t done it yet because I wanted the skill to be useful at ten questions before I tuned it for ten thousand.

Ten is also the real sample size I’m working from. The design isn’t validated; the loop is just cheap enough that one Markdown file already paid for itself the first time I was wrong about scheduled freezes.

When this is worth copying

If your newer engineers are spending most of their keystrokes reviewing rather than writing, and you have a real test suite they can’t just skim their way around, the passive-learning path through a codebase is narrower than it used to be. You have to replace that loop with something deliberate, and the cheapest option I’ve found is a daily, grounded quiz that takes about two minutes to answer during a tea break or while a Claude session is still thinking.

Tests make good ground truth, and a single Markdown file is enough runtime. Three correct answers out of ten is a fine place to start.