Swapping a Primary Key on a 4M-Row Table: Why I Took 10 Minutes of Downtime
Dropping two columns from a 4-million-row table meant swapping the primary key on the busiest table in our system. Why we skipped the zero-downtime dance and took ten minutes of downtime on purpose.
Our CI Insights job-metrics query kept timing out. The fix was to drop two columns almost nobody used, except those two columns were part of the primary key on a table with four million rows that gets a write on every CI job we ingest. To drop them, I had to swap the primary key on the busiest table in the system.
The problem
ci_job_metrics is an aggregation table. A trigger fires on every span_job insert and folds the new job into an hourly bucket, and the dashboard reads those buckets to show a job list.
The primary key carried two dimensions that turned out to be expensive: runner_group_name and runner_labels. They sat on top of the columns that actually identify a job, so the same job appeared once per runner-group-and-label combination it had ever run on. The job list filled with duplicate rows, and the aggregator had to recompute one bucket for every (job × runner-group × label-set). A month earlier we’d shipped a dedicated runner-analytics view. That’s where those numbers belong, so carrying the same two dimensions on the Jobs view was redundant. The extra cardinality was pure cost, and the aggregation query had started timing out in production.
Removing the two columns helped the query and the product at once. Both columns were part of the primary key, though, so dropping them meant changing the primary key itself.
Deciding against the zero-downtime dance
I made the call early: swap the key in place instead of building a new table.
The standard zero-downtime recipe is a shadow table. You create a new table with the right key, backfill it, keep it in sync with triggers, then rename it into place. It works, and it never locks the original. It’s also a lot of moving parts for a table that, in our case, didn’t need to stay online.
The job-metrics view is not business critical. We could turn it off in the frontend for half an hour and no customer would notice. So I picked the boring path: take the lock and change the key, downtime and all.
Before running anything, we made the table quiet. The frontend view was disabled, so nothing was reading, and we stopped the aggregation service. The trigger that fans span_job inserts into this table is the tricky one to stop, because CI keeps sending events. So the exclusive lock did that job for us: it queued any in-flight trigger writes until the swap was done, and the migration dropped and rebuilt the trigger to match the new key. Four million rows sat still while we operated on them.
The swap itself
flowchart TD
A[Disable frontend view, no readers] --> B[Stop aggregation service]
B --> C[Migration runs in one transaction]
C --> D[Take ACCESS EXCLUSIVE lock, queue writes]
D --> E[Consolidation DELETE keeps one row per new key]
E --> F[Drop the two runner indexes]
F --> G[Swap primary key via DROP then ADD CONSTRAINT]
G --> H[Drop the two runner columns]
H --> I[Rebuild trigger for new key]
I --> J[Release lock about 10 minutes later, writes drain]
J --> K[Restart aggregation service, re-enable view]
K --> L[Admin command re-aggregates from span_job]
The migration holds an ACCESS EXCLUSIVE lock the whole time and does its work in order.
First, collapse the rows that would collide once the runner columns leave the key. Under the narrower key, many old rows map to the same tuple, so keep one per group and delete the rest:
WITH ranked AS MATERIALIZED (
SELECT ctid,
ROW_NUMBER() OVER (
PARTITION BY start_time, end_time, github_repository_id,
branch_name, pipeline_name, job_name, ci_provider
ORDER BY runner_group_name, runner_labels
) AS rn
FROM ci_job_metrics
)
DELETE FROM ci_job_metrics
WHERE ctid IN (SELECT ctid FROM ranked WHERE rn > 1);
The survivor is arbitrary (whichever runner combo happens to sort first), so its numbers are wrong for the moment. More on that below.
Then the indexes. Two of them pointed straight at the runner columns: a GIN on runner_labels and a btree on runner_group_name, both left from completion endpoints that now read the dedicated runner table. Postgres would have dropped those on its own the moment the columns went, since an index loses its whole definition when a column it depends on disappears. I dropped them by hand so the migration says what it does instead of leaning on a cascade. The partial index that drives re-aggregation, ci_job_metrics_needs_update, never touched the runner columns, so it wasn’t part of this problem. The thing that actually forced the ceremony was the primary key. Both runner columns were in it, so a plain DROP COLUMN would have taken the key down with them, and I wanted a narrower key in its place.
Now the key swap, the part where I expected to be clever and wasn’t:
ALTER TABLE ci_job_metrics DROP CONSTRAINT ci_job_metrics_pkey;
ALTER TABLE ci_job_metrics ADD CONSTRAINT ci_job_metrics_pkey PRIMARY KEY (...);
The trick people reach for is CREATE UNIQUE INDEX CONCURRENTLY, then ALTER TABLE ... ADD CONSTRAINT ... USING INDEX to promote the prebuilt index without a long lock. That path can’t run inside a transaction, and Alembic wraps each migration in one by default. I could have flipped this migration to non-transactional mode and built the index concurrently. I didn’t, because I’d already decided the table could go offline, so the concurrent dance would have bought me nothing. The plain DROP CONSTRAINT and ADD CONSTRAINT under the lock was the honest version of the choice I’d already made.
After that, drop the two columns. The last step replaces the trigger with a version whose ON CONFLICT targets the new seven-column key.
What didn’t work: stuffing the re-aggregation into the migration
Back to that arbitrary survivor. Its metric values reflect one runner combo, not the total across all of them. To fix it, we bump a counter (update_counter) on each row, which tells the aggregator to recompute that bucket from the raw span_job data.
My first version did the bump inside the migration: a full-table UPDATE on four million rows, in the same migration that holds the exclusive lock. That’s a long time to keep the busiest table in the system frozen, so I split the bump into its own migration to shrink the lock window.
That version merged, then timed out on the test database, and I pulled it back out before it ever ran in production. Four million rows is enough to blow past the statement timeout on its own, and the deploy pipeline wasn’t going to sit and wait for it.
The consolidation DELETE in the schema migration ran under the same timeout and was fine, which looks backwards until you count the writes. That DELETE only removed the few rows that collapsed under the narrower key. The bump was an unqualified UPDATE over every row, around 1.9 GB of rewritten heap plus the WAL to match, so it was the heavier statement even though it read like the simpler one.
So I cut it from the SaaS migration. Once the trigger is back on, an active bucket recomputes itself the next time a job lands in it. Thirty-day retention clears out the idle ones before anyone looks at them. For rows that needed an immediate fix, we ran the re-aggregation from an admin command after the migration, where it could go at its own pace and be retried without risk. On-premise installs, where no one is watching a console, got a separate batched migration that walks one hourly bucket per commit.
There’s an honest cost to this. When the view came back on, older idle buckets showed approximate numbers until the admin command reached them, or until thirty-day retention deleted them. For a view whose runner numbers had already moved to a dedicated screen, that was a trade I was happy to make.
Moving the rewrite out of the locked migration is the only reason the schema change could stay fast. The swap itself stayed simple. Getting the data correct again did not, and that is where most of the work actually went.
Results
The schema migration ran clean in a single shot. The table was write-blocked for about ten minutes while the lock was held, well inside the thirty-minute window we had given ourselves to keep the view dark. No one hit a broken page, because the view was dark the whole time. The re-aggregation finished afterward through the admin command, and the query that had been timing out came back fast.
The wider change this key swap belonged to deleted more than five thousand lines of code. Cutting a dimension out of an aggregation pipeline is one of the cheapest performance wins around, as long as you’re sure no one needs the dimension.
Takeaways
Separate the schema change from the data backfill. The structural part should take its lock and release it fast; the part that rewrites four million rows belongs online and afterward, where a timeout costs you a retry instead of a failed deploy.
Zero-downtime migration is a technique you reach for when the table genuinely cannot go offline. Plenty of tables can. If you can hide a non-critical view for a few minutes, the in-place key swap is simpler and gets the fix to customers sooner.
The one condition is that you control the read path. We could flip off a single dashboard view and know nothing else touched the table. On our self-hosted installs, where we do not own the deploy or the console, the same change had to avoid locking the table at all. Taking downtime is a luxury of owning the whole stack.