Two counters instead of a materialized view
Materialized views recompute from source data, but we delete ours within 30 days. Here's the two-writer, two-counter rollup we built on plain Postgres, and the race a boolean dirty flag would have quietly lost.
We keep hourly metrics for CI jobs, test runs, and self-hosted runner fleets. The raw spans those metrics are built from are gone within 30 days, and the highest-volume ones within 24 hours. So the rollup is the only thing that remembers. That single fact rules out the obvious tool for the job.
The obvious tool that doesn’t fit
A materialized view is the first thing most people reach for when they need a rollup. You write the aggregation once, Postgres caches the result, and you refresh it on a schedule. For a lot of cases that’s the right answer, and I’ll come back to where I still use one.
It falls apart for us on two points.
The first is data lifetime. A materialized view recomputes from its source every time it refreshes. We don’t keep the source around. Mergify CI Insights ingests a lot of spans, and to keep storage sane we retain only a subset and only for a short window. By the time a dashboard asks “what was the p95 queue time three weeks ago,” the spans that answer it have been deleted. There’s nothing left to refresh from. The metric has to be computed while the data still exists and then survive on its own.
The second is refresh cost. The one materialized view we do run, default_branch_tests, refreshes with REFRESH MATERIALIZED VIEW CONCURRENTLY on roughly a ten-minute cron, and even that takes real time. Running something like that over a write-hot table often enough to feel fresh isn’t a trade I want to make.
So we built our own rollups. The shape is the same across every metrics table we have, and it comes down to two writers and two counters.
The obvious question is why not reach for something that already does incremental rollups, like TimescaleDB continuous aggregates. We didn’t, mostly because we wanted to start with the simplest thing and stay on native Postgres for as long as it holds. We also ship to self-hosted customers, so every extension we depend on is one more thing that has to exist and stay in sync on installs we don’t control. If you’re already running Timescale, lean on continuous aggregates and skip most of this.
Two writers per table
Every metrics table has a synchronous writer and an asynchronous one.
The synchronous writer fires on every source row. For runner_metrics it’s an application-level upsert called inline from the span processing path. The first span to land in an (owner, runner, labels, hour) bucket seeds the entire row, then every later span in that bucket only bumps a counter:
postgresql.insert(RunnerMetrics)
.values(
total_runs=1,
median_duration_seconds=span.duration,
p95_duration_seconds=span.duration,
p99_duration_seconds=span.duration,
# ... every stat is just this one span's value ...
update_counter=1,
computed_counter=1,
)
.on_conflict_do_update(
index_elements=RunnerMetrics.__table__.primary_key,
set_={"update_counter": RunnerMetrics.update_counter + 1},
)
The detail I like here is that a one-span bucket is already correct. The median, the p95 and the p99 of a single value are all just that value, so a bucket that only ever sees one span is finished the moment it’s inserted, with update_counter and computed_counter both at 1. No background work will ever touch it. That matters more than it looks, because a large fraction of buckets only ever see one or two spans.
The asynchronous writer is a periodic service that wakes up every 60 seconds, finds the buckets that have changed since they were last computed, and recomputes them in a batch:
select(*primary_key_columns, RunnerMetrics.update_counter)
.where(RunnerMetrics.update_counter > RunnerMetrics.computed_counter)
.limit(batch_size)
For each dirty bucket it re-aggregates the surviving spans and writes the real distribution back, using percentile_cont for the percentiles. Recomputing on a schedule instead of on every insert is the whole point. Under our ingest rate, updating a hot bucket row on every single span would be a lot of wasted writes and lock contention on rows that are about to change again anyway. Batching collapses a hundred inserts into one recompute.
The risk that buys is backlog. If ingestion outruns the service, dirty buckets pile up and the numbers a dashboard shows lag the truth by however long the queue takes to drain. We bound the work per pass with a batch size and run the service every 60 seconds, so the backlog stays short. Still, this is an eventually-consistent system, and under load the percentiles can sit a little stale. For an engineer looking at how their fleet did yesterday, a minute of lag is nothing. If you needed this to drive a live alarm, you’d want something else.
Why two counters and not a boolean
The interesting part is the dirty flag. It would be tempting to use a needs_recompute boolean: set it on write, clear it after recompute. That has a race.
Picture a bucket marked dirty. The async service starts recomputing it, reading the spans as they are right now. While that recompute is in flight, a new span lands in the same bucket. If the service clears a boolean when it finishes, it just erased the dirty mark for a write it never saw. The bucket now looks clean but is missing a span, and nothing will fix it until the next span happens to arrive, which for a quiet bucket might be never.
Two counters close the race. The writer increments update_counter. The service reads the current update_counter value when it selects the bucket, recomputes, and then sets computed_counter to that snapshot, not to “whatever it is now.” If a span landed mid-recompute, it pushed update_counter past the snapshot, so update_counter > computed_counter is still true and the bucket gets picked up again on the next pass. It’s the same trick a version number or an optimistic lock gives you: compare snapshots, never trust “now.”
A partial index on the dirty condition keeps the selection cheap, so the service isn’t scanning the whole table every minute looking for the handful of rows that changed.
One more thing the counters quietly rely on: the service runs as a single instance, sharded by org, so two workers never recompute the same bucket at once. The only race left is a write landing during a recompute, which is exactly what comparing snapshots handles.
Percentiles are the expensive part
It took me a while to internalize this. Most of these stats don’t need a background recompute at all.
Count, sum, min, max: all cheap to merge. You can maintain them incrementally as each span arrives, because folding a new value into a running aggregate is trivial. new_total = old_total + duration. new_max = greatest(old_max, duration). A trigger does that inline and the column is correct on every insert. Incremental min and max have one weakness, deletion, since you can’t walk a max backwards when its row leaves the table. We sidestep that because spans are only deleted at retention, long after the bucket has stopped changing, so nothing ever recomputes a max against rows that are gone.
Percentiles don’t fold like that. You cannot merge two p95s into a combined p95 without the underlying values. There’s no running accumulator for percentile_cont. To know the p95 of a bucket you have to look at every span in it. We recompute them exactly rather than reaching for an approximate estimator like a t-digest, because we keep the raw spans for the retention window anyway, so the real values are right there to aggregate.
Which field gets which treatment comes down to the cost of updating it. The simplest version, and where several of our tables still sit, lets the batched service recompute the whole row, cheap fields and percentiles together. The refinement is to move the cheap additive fields into the trigger, where they update instantly and stay correct on every insert, and leave only the expensive fields for the batched pass. We’ve started rolling that out table by table. On ci_job_metrics the trigger now keeps the counts, sums and min/max current inline, and the service recomputes just the six percentile columns. The other tables still do a full recompute, which is fine until their write volume makes the repeated work worth removing.
That refined ci_job_metrics flow, cheap fields inline and percentiles batched, looks like this:
flowchart TD
S["span_job insert"] --> T{"ci_job_metrics bucket exists?"}
T -- "no" --> Seed["seed the bucket from this one span<br/>update_counter = computed_counter = 1"]
T -- "yes" --> Inc["trigger updates cheap fields inline<br/>(total_runs += 1, min/max via LEAST/GREATEST)<br/>update_counter += 1"]
Seed --> Row[("metrics row")]
Inc --> Row
Row -. "every 60s" .-> Svc["batched service picks buckets where<br/>update_counter > computed_counter"]
Svc --> Re["recompute percentiles with percentile_cont<br/>over the surviving spans"]
Re --> Wb["write percentiles back<br/>computed_counter = snapshot"]
Wb --> Row
When a table has no expensive fields at all, the batched half disappears completely. test_metrics_summary computes everything in its trigger and has no service and no computed_counter. It’s fully synchronous because nothing it stores requires looking at the whole bucket.
The same shape, with deliberate variations
Once you see the pattern it shows up everywhere in our schema, and the variations are the interesting part:
| Table | Source | Sync writer | Async recompute |
|---|---|---|---|
ci_job_metrics | span_job | DB trigger, maintains all cheap aggregates | percentiles only |
runner_metrics | span_job | app-level upsert, seed + bump | full recompute |
ci_pipeline_metrics | span_pipeline | DB trigger, seed + bump | full recompute |
test_metrics | span_test via a staging table | trigger on staging, seed + bump | full recompute |
test_metrics_summary | span_test via a staging table | trigger, computes everything | none |
Two choices in there are worth calling out.
runner_metrics uses an application-level upsert instead of a database trigger, because its spans arrive from several processing paths (OpenTelemetry traces, Buildkite, GitHub jobs) and it was cleaner to ingest from one place in code than to attach the same logic to every insert site. The other job and pipeline tables sit behind a single insert path, so a trigger fits them better.
Tests route through a span_test_rollup_staging table before they reach their metrics, while jobs write to their metrics directly. The reason is pure volume: there are far more tests than jobs, so tests get a short-lived staging rollup to absorb the firehose, and that staging table carries the aggressive 24-hour retention. The short retention isn’t an accident. It exists to give the metrics service enough time to compute the rollups without paying to store every raw test span for 30 days.
There’s also a cardinality guard worth a mention. Runner labels can contain garbage (per-run IDs people inject into labels), so the aggregation filters labels down to an allowed set, and a per-owner cap trigger limits how many distinct label-sets a single owner can create. Without that, one customer’s weird labeling scheme could blow up the size of everyone’s rollup table.
What bit me
The label-set is part of the primary key of these buckets, and for a while it was computed in two different languages. Python built the key at ingestion with sorted(). SQL rebuilt it at recompute time with ARRAY_AGG. Those two don’t order strings the same way, so the keys disagreed, and the recompute ended up looking for a bucket that ingestion had stored under a slightly different key.
The fix was to force the SQL side to match Python’s byte ordering with ARRAY_AGG(... COLLATE "C"). It works, but it’s a band-aid over the real mistake, which was computing the same logical key in two places at all. Any time the same value is derived twice in different languages, you are one collation rule away from a silent mismatch. There’s a follow-up in flight to compute the key only in SQL and delete the Python version. Build the key once.
Where I’d still use a materialized view
I want to be honest about this, because the post would be dishonest as pure materialized-view bashing. We have exactly one materialized view left, default_branch_tests, and there’s a stack of PRs in flight to remove it and replace it with the synchronous test_metrics_summary table. So in practice I’m deleting my last one, not defending it.
But the reason I’d reach for a view is simple, and it’s the same rule that decides the whole design: frequency of writes on the source. A view refreshed on a cron is fine when its source changes slowly and a few minutes of staleness is acceptable. The two-writer pattern earns its complexity only when writes are frequent enough that recomputing on every one of them hurts. If your source is quiet, keep it simple: compute everything synchronously, or refresh a view on a timer, and don’t build any of this.
When to build this
So the decision comes down to one number: how often the source gets written. A high write rate pushes you toward deferring the expensive aggregation and batching it; a quiet source lets you stay synchronous or lean on a materialized view.
And when you do split it, split it along the right seam. The cheap aggregates can be maintained inline and stay correct on every insert. Only the things you can’t merge incrementally, percentiles in our case, need to run in the background. Two counters instead of a boolean keep that background pass honest under concurrent writes, which is the part that’s easy to get subtly wrong and never notice until a quiet bucket shows the wrong number.