How Agents Work With Parallel Agents

In March I capped myself at two parallel AI coding agents. Now I run a dozen, and the real bottleneck moved from writing code to reviewing it.

Eleven weeks ago I wrote a post explaining why I capped myself at two AI coding agents running in parallel. A third one, I argued, pushed the cognitive load past what I could manage, the way a manager’s span of control breaks down past a certain number of reports. I believed it. It was true at the time.

It isn’t true anymore. I now run a Claude Code session per task, sometimes a dozen at once, and I’m less frazzled than I was juggling two in March.

This is the follow-up, and it’s mostly me explaining why my own headline advice expired in about two months.

The two-agent rule had a shelf life

The rule was right for March. I just didn’t realize how short March was.

Back then, every agent needed steady babysitting: a clarification here, an approval there, a nudge when it wandered off. Adding a third didn’t add a third of the load, it multiplied it, because I was the scheduler context-switching between live processes that each stalled the moment I looked away. That’s the span-of-control wall, and it was real.

Two things moved it. The first: agents got more autonomous. With auto mode and a clear goal stated up front, I can hand over a whole piece of work and walk away while it runs. The load per agent dropped.

But lower per-agent load only explains why I could comfortably run three or four, not why I run a dozen. More autonomy on its own just means less babysitting. It doesn’t dissolve a ceiling. What dissolved the ceiling was figuring out where my old mental model was wrong about what the load even was.

The session I lost in a hidden worktree

Here’s the problem that taught me that.

My typical flow: start a session to implement a feature, push it, watch CI go green, close the session, move on. A day later the reviews come back, and I want the original context, the agent that already knows why I made every call, so I can address feedback fast.

Except the feature lived in our support tool, a different repository, in a different git worktree, and I never renamed the session. So I sat there trying to reconstruct which directory, which branch, which of forty terminal tabs. I’ve lost real hours to exactly this. Finding a session I’d opened three days earlier, just to ask it to fix some tests, turned into an archaeology dig.

It gets worse when I fan out. I often ask a session to spawn parallel sub-agents to validate and challenge a solution. Great for quality. But now my original session is one row buried under a pile of review sub-agents, and the parent is the one I actually need. The thing that made my work better made it harder to navigate. That second-order effect, more than raw agent count, was what made parallelism feel heavy.

What the cognitive load actually was

Once I saw the lost-session problem clearly, the “ceiling” came apart into two different costs I’d been lumping together.

One is concurrent monitoring: knowing, right now, which agent needs me. The other is retrieval: finding a specific agent again later, sometimes days later, sometimes under a stack of its own children. March me felt both as one undifferentiated “this is a lot to hold in my head” and concluded the safe number was two.

The Claude Code agents view splits them and chips away at both. For monitoring, status is explicit: I see at a glance which session is waiting on me instead of polling each one. For retrieval, I rename sessions so “fix the flaky runner test” stays legible, and I pin the ones that matter so they don’t sink.

So the division of labor is: autonomy lowered how often each agent needs me, and the view changed how I find out. Together they turn babysitting into a queue I drain top-down. The agent count stopped being the limit. What bites is how many sessions go red in the same five minutes, and because independent tasks rarely finish at the same moment, that’s almost never more than two. On a normal day I’ll have at least ten sessions going, and most carry their task from /goal to a pushed stack without me stepping in, so they sit green and out of mind until they’re done. A dozen idle-but-ready sessions cost me almost nothing. Two simultaneous interrupts cost exactly what they always did.

Pin as a soft-delete

One habit deserves its own note. When a session looks done, I don’t close it. I pin it, and keep it pinned until the work is approved, merged, and deployed.

That’s deliberate. Once a PR is merged it’s usually fine, but “usually” is carrying weight in that sentence. Follow-ups happen: a reviewer comes back late, or a bug shows up in staging under load. When it does, the pinned session is the original context, already warm, and picking up from there beats a cold start. Closing the moment CI goes green is tidying up at exactly the wrong time.

flowchart LR
    A[Start session per task] --> B[Push, CI green]
    B --> C[Pin: soft-delete]
    C --> D{Approved, merged,<br/>deployed?}
    D -- "no / follow-up" --> E[Reopen pinned session,<br/>context still warm]
    E --> B
    D -- yes --> F[Unpin: real delete]

A session for every task, including the boring ones

Once finding agents stopped hurting, the two-agent ceiling came apart on its own. I now open a session for nearly every task I’m handed.

The clearest payoff is my Linear backlog. Like everyone, I had a long tail of small chores that never won a priority fight against real feature work. Reject NUL bytes in a config payload. Add deterministic row ordering to kill a rare deadlock. Individually trivial, collectively a nagging debt, never the most important thing on any given morning.

It’s now cheaper to hand the whole chore to an agent than to keep re-deciding whether it deserves my afternoon. For the small ones I reuse one line:

/goal ultrathink Tackle: <LINEAR_ISSUE_ID>.
Use your own worktree with a new branch.
Commit your changes. Then, push the stack.

Plan, spec, build, test, push the stack. Then I review the pushed PR myself, like any other. Over the last three weeks I closed around thirty issues this way, a good chunk of them the exact small-but-never-prioritized chores that had been sitting for a month.

I won’t pretend all thirty landed perfectly first try. The straightforward ones usually needed little from me, a handful needed a second pass, and a couple I’d have written differently by hand. The under-specified ones are where I lean hard on Plan mode before any code gets written, the same front-loading I wrote about in March. The throughput number means nothing on its own without a quality bar, and the only thing that makes it defensible is the review gate I’ll get to below. (Yes, the prompt is still a manual copy-paste, which is a little embarrassing given where I work. Turning it into a skill is the obvious next step.)

The bill came due in code review

Here’s the part I won’t pretend away: review got worse.

This was already the constraint in my March post, and shipping more small PRs from a session-per-task flow made it tighter. More throughput upstream means more pressure downstream, and downstream is a human reading a diff. The honest summary is that I made my own output cheaper partly by making my teammates’ review more expensive. I don’t get to celebrate the first half and skip the second.

There’s a sharper version of the critique too. Clearing a backlog of “never-prioritized” chores assumes they were under-prioritized. Maybe they were correctly deprioritized, and I’ve now spent the scarcest thing on the team, a reviewer’s attention, on work we’d already judged wasn’t worth doing. I think most of the thirty were worth it. I’m not certain about all of them.

Did the team actually get faster, or did I just move my bottleneck onto everyone else? Faster, I think, for a dull reason: writing code was the obvious constraint, and clearing it handed back the very time review now demands. I think that trade comes out ahead, though I’m watching it closely.

So the real work right now is making review survive the volume. We released Mergify Stack to split big batches into small, independently reviewable pieces, so a reviewer faces a focused change instead of a wall. We adapted our review rules to match. And for simple, obvious changes, we’re moving toward reviewing the intent rather than every line.

That last shift is the contentious one, and I want to be honest about its price. In practice it means that for a change like rejecting NUL bytes in a config payload, the reviewer reads the issue and the test, checks the diff does what the ticket asked, and trusts the rest. It only works if the author owns the outcome, fully. It trades line-by-line scrutiny for trust, and trust has a failure mode: misplaced, it rubber-stamps a bad change. We’re betting that strong ownership plus good gates beats that risk. I honestly don’t know yet whether the bet pays off. Ask me in three months.

Agents reviewing agents

The way we keep that bet from blowing up is by raising the quality of what reaches a human at all. When volume is this high, the reviewer’s time is the bottleneck, so nothing should land in their queue unless it already clears a bar.

So there’s a layer of agents in front of the human, not in place of the human: every PR still gets a human review, at least for now. I keep work in GitHub Draft mode until it’s actually ready, out of everyone’s queue. Before promoting it, I let agentic reviewers go at it. Copilot reads the diff as a separate model. And instead of asking the session that wrote the code to grade its own homework, I spin up review agents with no context or a deliberately fresh one, using off-the-shelf skills like grill-me, so they come at the change cold and without the author’s blind spots.

This is the second meaning of the title, and it’s the part that still feels strange to me. I’ve shifted from writing code to orchestrating agents that write it, and those agents orchestrate their own sub-agents to check each other. By the time a teammate opens the PR, several agents have already argued about it. The human is the last reviewer, not the first. I can’t yet prove that gate catches what a careful human would. It’s the load-bearing assumption under everything above, and it’s still an assumption.

What I’d tell March me

The lesson has nothing to do with the number. The right count of parallel agents is whatever your tooling and the agents’ autonomy currently allow, and both move fast enough that any fixed rule you write down is a snapshot with a short expiry date. Mine lasted about two months. Your tools will differ and keep changing too; the part that transfers is the distinction underneath, between watching agents and finding them again.

Two things actually moved the ceiling, and neither was a number: agents that can carry a task on their own, and a view that tells me which one needs me. Once those landed, counting agents stopped being the useful question. The better question is where the cost goes when you stop writing the code yourself. For me it went straight into review, and that’s the problem I’m working on now. The agents turned out to be the easy part.

How Agents Work With Parallel Agents

The two-agent rule had a shelf life

The session I lost in a hidden worktree

What the cognitive load actually was

Pin as a soft-delete

A session for every task, including the boring ones

The bill came due in code review

Agents reviewing agents

What I’d tell March me

Tired of broken main branches?

Recommended posts

Merge Queues Were Built for Humans. AI Agents Need More.

We Let an LLM Open Pull Requests Against Your .mergify.yml. The Prompt Was the Easy Part.

I'm the new hire on a codebase I didn't write