225 Self-Hosted GitHub Actions Runners: Why We Picked Docker Over VMs

Our GitHub Actions bill hit $400 a day. We moved CI in-house to three bare-metal hosts, tuned libvirt VMs to match, and then dropped the VM tier entirely for Docker. Here's the path, the moments that decided it, and what bit us along the way.

Our GitHub Actions bill hit $400 on a normal working day. The dashboard had been climbing for a while, but that was the number that made me actually start the work. I’d been wanting to bring CI in-house anyway. The bill pushed it from someday to this quarter.

Cost was the obvious reason. The less obvious one is that we sell CI Insights to teams running their own CI infrastructure, and we’d been running on GitHub-hosted, which is the one setup our customers don’t have. Bringing CI in-house would close that gap. Three bare-metal boxes in a colo seemed like the cleanest version: predictable bill, real infrastructure to dogfood the product against.

Three minutes bare-metal, thirteen minutes in a VM

The boxes arrived a few weeks later. The first thing I did was clone our test suite onto one of them and run it directly on the bare-metal host. Three minutes. The same suite that took 5-6 minutes on GitHub-hosted. Encouraging.

Then I dropped the same code inside a libvirt VM on the same host. Thirteen minutes.

That gap was the actual problem. Hardware that could run the suite in three minutes was running it in thirteen with one virtualization layer in the way. Before deciding the architecture for 225 runners, I wanted to know what a properly tuned VM could do.

Tuning the VM, knob by knob

I worked through the standard tuning loop. Every self-hosted runner post mentions this list; few show the knobs that actually moved the number.

amd_pstate=active on the kernel cmdline. The biggest win in the lot. Our host provider ships with default CPU throttling that pegs CI workloads at roughly half the AMD chip’s real ceiling. One line of cmdline gave us back the CPU we were paying for.
Hugepages reserved at boot. A small gha-tuning.service unit reserves 245760 × 2MB (480 GiB) on each host.
CPU pinning per VM.
cache=unsafe on the VM root disk.
tmpfs for /home/runner/_work and /var/lib/docker inside the VM.
RAM up from 4 GiB to 8 GiB, with the VM count fixed at 60 per host so it’d fit.

By the time I’d worked through all of these, the VM was hitting 5-6 minutes per job. Match with GitHub-hosted. The architecture was good enough to ship.

Except I noticed something else.

The cost the benchmarks didn’t show

Every change to the runner image needed a packer rebuild, then a redeploy across the pool. The first time you do that, it’s a Tuesday afternoon. By the fifth or sixth time, it’s a meter you’re trying to ignore. Job duration is what we measure because dashboards show it. Image-iteration time is what we feel, every time a runner image needs a small tweak. The benchmarks had decided nothing yet.

The pivot to Docker

A few weeks of tuning in, with the VM at 5-6 minutes and not much obvious left to optimize, I sketched the alternative one evening. Drop the VM tier entirely. Run each runner as a container directly on the host, each in its own fs/net namespace, all sharing the host’s dockerd. If a container can give us per-runner isolation good enough for our threat model, the entire VM layer is something we’re paying for and don’t need.

I built a parallel reconciler called gha-farm2 next to the existing VM one (and gha-build-image2 for the image), so the two paths could run side by side. After a week of cohabitation, per-job wall time on Docker landed in the same 5-6 minute window as the tuned VM, with all variation comfortably inside job-startup noise.

That’s where the decision became real. Job duration didn’t pick a winner. What did pick a winner was the loop around it. Booting a fresh runner pool after an image change is seconds with Docker and minutes with libvirt. Iterating on the runner image is a Dockerfile push for one path and a packer rebuild for the other, and that gap compounds every time you touch the image.

Firecracker, briefly

Firecracker and similar microVM runtimes came up briefly during the pivot. They’d give us back VM-grade isolation at container-grade speed, and on paper they’re the right answer. We didn’t go that way because the isolation we’d be buying back is isolation our current workload doesn’t need: private repos only, no secrets in workflow context. If we ever opened the fleet to public-repo CI, the calculus flips and we’d pay the Firecracker setup cost then.

The security trade-off, explicit

Docker is less isolated than a VM. The Docker pool only runs against non-public repositories. Enforcement is at GitHub’s runner-group layer: the Docker pools are scoped to private repos through runner groups, and nothing in a public repo’s workflow can route a job onto them.

Keeping CI secrets out of workflows that target these runners is a separate constraint, and it’s not something runner groups enforce. That part is convention: secret-handling jobs are structured to run elsewhere. The runner-group scoping handles the repo-isolation boundary automatically; the secrets discipline is on us.

Why not Kubernetes or autoscaling?

There’s a fork in the road for any team running its own CI. The dynamic option is something like actions-runner-controller (ARC): a Kubernetes operator that scales GitHub Actions runners up and down based on workflow demand. GitHub’s own Larger Runners sit in the same family — you pay per minute, but the system spins up capacity on demand. With either, queue time stays near zero. The bill comes out wherever it lands.

The static option is the opposite. A fixed number of always-on runners, like ours. The bill is flat. But when 226 jobs arrive at once, the 226th one waits its turn.

You get to bound one of those variables. The other will float.

quadrantChart
    title What you bound vs what you let float
    x-axis Bill floats --> Bill bounded
    y-axis Queue time floats --> Queue time bounded
    quadrant-1 Hybrid
    quadrant-2 ARC / autoscale
    quadrant-3 Capped GH-hosted
    quadrant-4 Our static fleet

We bound the bill. The fleet caps cost, and the dogfooded CI Insights chart tells us when queue time has drifted out of what we’re willing to tolerate, at which point we add a host. ARC would have given us the opposite default, and we’d have spent the same evenings fighting bill spikes that we now spend nudging pool counts.

On top of that, we don’t operate a Kubernetes cluster. ARC’s complexity isn’t worth standing one up for. A few hundred lines of Python and a farm.yaml reconciler beat a cluster operator when “a cluster” is the line item you’re trying to avoid.

The fleet shape

pyinfra/inventory.py lists three hosts: gha-runners-001/002/003. The per-host layout in farm.yaml:

servers:
  gha-runners-001:
    pools:
      - memory: 9
        cpus: 2
        count: 60
        labels: self-hosted-ubuntu-24.04
        group: self-hosted
      - memory: 12
        cpus: 2
        count: 15
        labels: self-hosted-ubuntu-24.04-12g
        group: self-hosted

Three hosts × (60 + 15) = 225 concurrent runners. No autoscaling. We cap the fleet, watch queue time per pool, and adjust counts by hand when the picture shifts.

How we picked the 60/15 split: we ran the workload for a few days, watched the queue-time-per-pool chart in CI Insights, saw the 12G pool taking the heat, and shifted counts until both pools sat flat. No autoscaling, no per-job billing — a chart and a knob.

The dogfooding paid off here. Sizing a fixed fleet against your actual queue-time distribution is exactly the workload CI Insights was built for, and running it on our own farm shook out features we wouldn’t have prioritized otherwise.

What runs on each host

Two layers per host:

pyinfra/deploy.py provisions everything: APT packages, /etc overrides (sysctl drops, udev rules, systemd units), GRUB kernel cmdline with a conditional reboot, Datadog agent with infra + docker metrics, the proxy/mirror stack, and the runner image.
On-host scripts under /root/gha-runners/scripts/ (gha-farm, gha-build-image, gha-proxy) manage the runtime: containers, image, registries.

gha-farm is a declarative, busy-aware reconciler. It reads farm.yaml, asks GitHub which runners exist and which are busy, and produces a plan: create, recreate, destroy, drain. In-flight jobs are never killed mid-flight.

gha-proxy runs a set of cached upstreams bound to 172.17.0.1 so only on-host containers can reach them:

Docker registry mirror on :5000
verdaccio on :4873
devpi on :3141

flowchart LR
    subgraph host["Bare-metal host"]
        c1["Runner container 1"]
        c2["Runner container 2"]
        cN["Runner container N"]
        proxy["gha-proxy<br/>172.17.0.1"]
        c1 --> proxy
        c2 --> proxy
        cN --> proxy
    end
    proxy -->|":5000 mirror"| dh["Docker Hub"]
    proxy -->|":4873 verdaccio"| npm["npm"]
    proxy -->|":3141 devpi"| pypi["PyPI"]

We didn’t plan to ship the proxy on day one. The fleet forced it. The second a job tried to pull semgrep/semgrep:1.159.0 on the new box, Docker Hub’s unauthenticated rate limit fired:

“Error response from daemon: toomanyrequests: You have reached your unauthenticated pull rate limit.”

The same workload had worked on the first host because we’d been pulling slowly. With 60 containers reaching for the same image at the same time, we hit the wall instantly.

The proxy stack is the piece of infrastructure I’d insist on for anyone running this kind of setup. Every package provider has a rate limit or quota you’ll hit the second you point a fleet at it directly. Docker Hub publishes its pull limits (usage docs); npm has had rolling rate limits since 2017 (npm blog); PyPI is the same story. Once the cache is in, you stop thinking about those limits. Latency drops as a bonus, and the first time PyPI has an outage the cache covers you for free.

Where GitHub got in the way

There’s no event-driven “this runner is done” signal from GitHub, and the runner lifecycle API has rough edges around it. Everything in this section is the workaround stack we ended up with.

The GHA runner agent (Runner.Listener) doesn’t expose a “drain, then exit cleanly when the current job is done” mode. SIGTERM eventually kills it, but inside the 30-second graceful window the agent will still accept a new job. CI never goes idle long enough for the kill window to coincide with a quiet runner, so you can’t just wait for SIGTERM to do the right thing.

The closest workaround is to PATCH the runner through GitHub’s API and strip its labels. Once the labels don’t match what any workflow targets, the dispatcher stops sending it work. That part is clean enough. The dirty part comes next: GitHub gives you no callback when the in-flight job ends, so you reconcile three sources of truth by polling.

stateDiagram-v2
    [*] --> Active
    Active --> Draining: PATCH runner, strip labels
    Draining --> Polling: dispatcher stops sending work
    Polling --> Polling: GitHub busy=true
    Polling --> Polling: container process alive
    Polling --> SafeToDrop: all three sources agree idle
    Active --> Orphaned: host dies hard
    Orphaned --> SafeToDrop: GitHub releases (eventually)
    SafeToDrop --> [*]

The three sources are GitHub’s busy flag, the container list (is the runner process still alive?), and the host’s own reachability (did the host itself disappear?). There’s always a window where they disagree. A webhook or a single end-of-job event would collapse the whole thing into ten lines.

We did look at --ephemeral, which auto-deregisters a runner after one job and would sidestep the drain problem. It didn’t fit: our fleet runs as a static control plane, with a fixed number of long-lived runners per pool. Switching to ephemeral turns it dynamic, with a new container after every job, which is more state to track and reconcile for no benefit our workload would notice.

And when a runner host dies hard, the picture gets worse. You hit DELETE /repos/.../actions/runners/{id} on a registration whose host is offline. GitHub returns 422 or 500 with no reason in the body. The runner sits in “registered, offline” state. GitHub eventually releases it, but the delay is long enough that you’ll have moved on to something else first.

The two API gaps I’d most like GitHub to close: an event-driven signal for “this runner has finished its in-flight work and is safe to drop”, and an explicit deprovision endpoint that respects state and returns a real error reason when it refuses.

Things that bit us along the way

A short list, in roughly the order we hit them.

Host swap death at ~90 containers per host. The cgroup memory limits weren’t enough to stop the host kernel from swapping when the working set crossed physical RAM. Had to reboot. Backed off to 75 containers per host.
overlay2 on tmpfs. We started with /var/lib/docker on tmpfs for speed. Docker’s overlay2 driver refused to mount on it (level=error msg="failed to mount overlay: invalid argument"). Switched to NVMe-backed storage with no measurable cost.
Docker daemon config conflict. --registry-mirror flag and registry-mirrors in daemon.json set at the same time: the following directives are specified both as a flag and in the configuration file. Pick one.
hostedtoolcache permissions. Sharing a hostedtoolcache mount across runners produced ./python: Permission denied from actions/setup-python. Fixed by giving each runner its own toolcache. Lost some cache hits, gained back the ability to run the suite.
Dashboard vitest OOM on the 12G pool. This is what the 12G tier was supposed to fix. It didn’t. Two changes landed in the monorepo: cap Node heap at 4G and switch vitest to a single fork (commit 47942b7e23), then shard the front-test into a 2-shard vitest --shard matrix (commit a054d0214e). Even an oversized runner is not a substitute for parallelism in the test config.

What we got

225 concurrent runners across three hosts. Fixed cost. Job duration where we wanted it, on hardware we control.

---
config:
    xyChart:
        width: 760
        height: 420
        xAxis:
            labelPadding: 12
---
xychart-beta
    title "Test suite duration by configuration (minutes)"
    x-axis ["GitHub", "Baremetal", "VM untuned", "VM tuned", "Docker"]
    y-axis "Minutes" 0 --> 14
    bar [6, 3, 13, 6, 6]

Docker matches the tuned VM at the job level, and boots in seconds and re-images in seconds. Build cost (the reconciler, the proxy stack, the pyinfra deploy, the tuning service) was a couple of days with AI assistance. Sizing decisions come straight off the CI Insights queue-time-per-pool chart.

On the bill side: three servers at €599 each is €1,797/month flat. Our previous GitHub-hosted bill ran around $400/day on working days, roughly $8,800/month. The two curves cross at about 250k runner-minutes per month, which is roughly 35 minutes per runner per day at our fleet size. Past that point, the marginal CI minute is free.

---
config:
    xyChart:
        width: 800
        height: 500
        xAxis:
            labelPadding: 8
            titlePadding: 8
        yAxis:
            labelPadding: 8
            titlePadding: 8
---
xychart-beta
    title "Monthly cost: GitHub-hosted (rising) vs self-hosted fleet (flat)"
    x-axis "Minutes/month (k)" 0 --> 1500
    y-axis "Cost (€/mo)" 0 --> 12000
    line [0, 1850, 3700, 5550, 7400, 9250, 11100]
    line [1797, 1797, 1797, 1797, 1797, 1797, 1797]

What we’d insist on next time

If you’re sizing a self-hosted CI fleet for your own team, the variable to track isn’t job duration. It’s image-refresh time, because that’s the one you’ll pay every time you change the runner. Boot time matters for the same reason. Job duration shows up in dashboards; the other two show up in your week.

Build the proxy stack on day one. Docker Hub will rate-limit a fleet of any size the second you point it at the public registry, and the same is true of npm and PyPI. Once the cache is in, you stop thinking about it, and you get incident-tolerance against upstream outages along with the rate-limit fix.

Run Docker-per-job only when the isolation trade-off is one you’re comfortable defending. Our threat model is internal teams on private repos with no secrets in the workflow; that’s why runner-group scoping is enough for us. If your fleet ever has to serve public-repo PRs, the calculus flips and you should look at Firecracker before Docker.

Size against queue time, not CPU, once the bill is fixed. For us, RAM is the bottleneck. That’s the reason for two pools and a container count capped under the swap threshold. The host provider’s default CPU throttling was the only CPU issue we ever fought, and amd_pstate=active made it go away.

The single biggest improvement we’d ask GitHub for is an event-driven “this runner has finished its in-flight work” signal. Until then, the polling state machine stays in, and the 422 with no body keeps being a feature request waiting to happen.