ON CONFLICT DO UPDATE Is Rewriting Rows You Never Changed
A bare INSERT ... ON CONFLICT DO UPDATE rewrites the whole row even when nothing changed. Production numbers showing what that costs, the one-line WHERE clause that stops it, and the 21 call sites where we had to leave the write alone.
A bare INSERT ... ON CONFLICT DO UPDATE rewrites the entire row even when every column already holds the value you’re writing. We learned how expensive that is the hard way, during an incident years ago. Then last month, a one-line pull request reminded me.
The SQL fix is old news to anyone who has lived in Postgres for a while. The part worth your time is the audit it kicked off: 32 upserts in our codebase, and the judgment call about which 21 of them I had to leave alone.
The review that started it
A colleague’s PR flipped an ON CONFLICT DO NOTHING into an ON CONFLICT DO UPDATE. Reasonable change in isolation: the old code dropped updates on the floor, the new code would keep the row fresh. Except this particular insert sits on one of the hottest paths we have. Every time we cache a GitHub resource, we touch the owning account:
flowchart LR
E["every GitHub resource we cache<br/>(PRs, check runs, repos, ...)"] --> U["INSERT ... ON CONFLICT<br/>upsert of the owner"]
U --> A[("github_account<br/>221k rows that<br/>almost never change")]
So flipping that one statement to DO UPDATE meant every cache write across the system would now issue an UPDATE on github_account, a table whose rows almost never actually change.
I’d seen this movie before.
The incident
A few years back we had a DO UPDATE with no guard on a hot table, and it caused a real outage. The failure mode wasn’t disk or replication lag, it was lock contention. Every UPDATE takes a row lock, so when the same hot rows get rewritten constantly, writers serialize behind whoever holds the lock:
sequenceDiagram
participant W1 as writer 1
participant W2 as writer 2
participant W3 as writer 3
participant Row as hot row
W1->>Row: UPDATE (takes row lock)
W2--xRow: UPDATE (blocked)
W3--xRow: UPDATE (blocked)
Note over W2,W3: waiting on the lock,<br/>holding their own locks meanwhile
W1->>Row: COMMIT (releases lock)
W2->>Row: UPDATE (takes row lock)
That backpressure spread to every other insert and update sharing those rows. The whole write path slowed to a crawl.
The fix at the time was blunt: downgrade the statement to DO NOTHING. That killed the contention, but DO NOTHING skips the write even when the row really changed, so we leaned on a separate reconciliation service to catch real updates later. It worked, but it traded one problem for another.
So we’ve lived through both ends of this: bare DO UPDATE rewrites the row on every touch, DO NOTHING never writes the change at all. The version I actually wanted writes only when a column moved, and Postgres has had it the whole time. It wouldn’t have fixed that outage on its own, since the lock is taken either way, but it kills the write amplification that made every collision so expensive.
A no-op update is not free
The mental trap is thinking an UPDATE that sets a column to the value it already holds is cheap, or even a no-op. It isn’t. Postgres is MVCC, so an UPDATE never edits in place. It writes a brand-new tuple, marks the old one dead, and leaves the dead version for autovacuum to clean up later. Every one of those rewrites costs:
- a new tuple plus a dead tuple, so autovacuum has more to chase
- a WAL record, which also ships down your replication stream
- a fresh entry in every index on the table, unless the update qualifies as HOT (heap-only tuple), which needs free space on the same page and no indexed column changing value
- a row lock on that row, held until the transaction commits, so anyone else writing the same row serializes behind it
The first two costs are unconditional: tuple churn and WAL (and the autovacuum and replication that follow) land on every no-op write, every time. The lock cost is the conditional one. It only bites when writers actually collide on the same hot row, and locks release at commit, not at statement end, so how much it hurts depends on your concurrency and your transaction length. For data that changes on every write, you pay all of this and you get something back. For data that almost never changes, you pay it for nothing.
And autovacuum cleaning up the dead tuple doesn’t hand the disk back. It marks the space as reusable inside the table, but the file on disk essentially stays as big as the bloat ever made it. Returning that space to the OS means rewriting the whole table: VACUUM FULL, which takes an exclusive lock you can’t afford on a hot table, or pg_repack to do it online. So every no-op write quietly raises the floor on your table size and pulls your next pg_repack run forward.
Postgres can be told to skip the write when nothing changed, with a guard on the DO UPDATE (that’s the one-line fix, coming below). Here is what the two paths cost when the incoming row is identical to the one already stored:
flowchart TB
A["INSERT ... ON CONFLICT DO UPDATE<br/>incoming row == stored row"] --> B{"guard on the<br/>DO UPDATE?"}
B -->|"bare DO UPDATE"| C["UPDATE runs anyway"]
C --> C1["new tuple written, old marked dead"]
C1 --> C2["WAL record + replication bytes"]
C2 --> C3["every index re-pointed (non-HOT)"]
C3 --> C4["row lock held until commit"]
C4 --> CX(["paid in full, nothing changed"])
B -->|"guarded DO UPDATE"| D["guard sees nothing<br/>changed, skips the write"]
D --> E(["no new tuple, no index writes,<br/>only a tiny row-lock WAL record<br/>(row still locked until commit)"])
The numbers
I pulled pg_stat_user_tables from our production engine database. These counters are cumulative since the last stats reset in December 2024, so they’re about 18 months of activity, almost all of it before this fix shipped. A couple of tables stand out.
| Table | Live rows | Updates | Updates per insert | HOT updates | Table size (heap / indexes) |
|---|---|---|---|---|---|
github_check_run | 206M | 427,048,883 | 1.0Ă— | 38.7% | 395 GB (243 GB / 113 GB) |
github_pull_request | 1.23M | 214,554,123 | 28Ă— | 0.0% | 7.3 GB (2.4 GB / 4.9 GB) |
github_repository | 466k | 46,792,486 | 100Ă— | 0.0% | 266 MB (60 MB / 206 MB) |
github_account | 221k | 1,718,269 | 8.3Ă— | 0.02% | 45 MB (23 MB / 21 MB) |
The ratio to look at is updates per existing row:
xychart-beta
title "Cumulative UPDATEs per existing row (~18 months)"
x-axis ["check_run", "pull_request", "repository", "account"]
y-axis "UPDATEs per row" 0 --> 180
bar [2, 174, 100, 8]
pull_request and check_run change for real as a PR moves through its life, so high churn there is expected. repository and account do not. A repo getting rewritten 100 times over is the part that should make you wince.
Look at github_repository. There are 466k repositories in the table, and we ran 46.8 million updates against them. That’s roughly 100 updates for every row that exists, on data (a repo’s name, owner, default branch) that barely moves. None of those updates were HOT, so every one of them rewrote index entries on top of the row itself. Of that 266 MB table, 206 MB is indexes, and we were churning them for writes that mostly changed nothing.
github_check_run is the heaviest table we have at 395 GB, with 113 GB of indexes and 427 million updates. I can’t claim those were no-ops (check runs change status as CI progresses), so I’m not using it as proof of waste. The scale is the point: even at 38.7% HOT, about 262 million of those updates were non-HOT, each one rewriting index entries across 113 GB of indexes. When a single table is 395 GB, every redundant rewrite you avoid is real disk and real replication traffic you don’t spend. github_account, the table from the code review, took 1.7 million updates with 396 HOT updates total, on rows that “almost never change.”
You might be tempted to reach for fillfactor here. Lowering it leaves slack on each page so more updates can stay HOT and skip the index churn. It helps, but it doesn’t make a no-op write free: you still allocate a new tuple, write a WAL record, and ship it down replication, all to land on the value the row already held. Fillfactor trades index churn for page bloat. Not writing at all trades nothing for nothing. (All these tables run the default fillfactor of 100, for what it’s worth.)
One honest caveat: pg_stat_user_tables counts updates that executed. It can’t tell me what fraction were no-ops, because once you suppress them they simply stop being counted. So the claim isn’t “we measured X% waste.” The claim is narrower and, I think, harder to argue with: a table this static has no business absorbing update volume this high. And to be straight about it, this whole analysis comes from counters and Postgres mechanics. I never ran a before/after latency benchmark. I’m confident in the mechanism, but I won’t quote you a saved-milliseconds figure I don’t have.
The one-line fix
Postgres lets you put a WHERE clause on the DO UPDATE action. If the predicate is false, the row is left exactly as it is: no new tuple, no index churn, and nothing in the WAL beyond the tiny record the row lock writes. The predicate you want is “at least one column actually differs.”
One thing the guard does not save you: the lock. The docs are explicit that “all rows will be locked when the ON CONFLICT DO UPDATE action is taken,” so Postgres still takes the row lock even when the WHERE skips the write. The guard cuts write amplification. It does nothing for the lock itself. On a near-static table the lock is brief and barely contended, so the wasted writes were the cost worth killing. If your actual bottleneck is hot-row lock contention, the guard alone won’t save you.
I wrapped our SQLAlchemy upsert helper so this is the default behavior instead of something you have to remember:
def on_conflict_do_update_if_changed(
insert_stmt: postgresql.Insert,
*,
set_: dict[str, typing.Any],
**kwargs: typing.Any,
) -> postgresql.Insert:
changed = sqlalchemy.or_(
*(
insert_stmt.table.c[column].is_distinct_from(
insert_stmt.excluded[column],
)
for column in set_
),
)
return insert_stmt.on_conflict_do_update(set_=set_, where=changed, **kwargs)
It ORs <col> IS DISTINCT FROM EXCLUDED.<col> over every column in the set_. If they all match, the OR is false and Postgres skips the write. One assumption baked in: set_ maps each column to its EXCLUDED value, the usual pass-through upsert. Computed updates like counter increments don’t belong in this wrapper anyway.
Note the IS DISTINCT FROM and not <>. A plain <> comparison returns NULL when either side is NULL, which Postgres treats as not-true, so a transition from NULL to a real value (or back) would be seen as “no change” and silently dropped. IS DISTINCT FROM is the NULL-safe version: it treats NULL as just another value and returns a clean boolean. This is the kind of detail that’s easy to get wrong and never notice until a column stops updating. I had an AI agent flag the choice while I was writing this and ask which semantics I wanted. I wanted the correct one.
The hard part: where you must not do this
The wrapper took ten minutes. The actual work was auditing all 32 places we issue an ON CONFLICT DO UPDATE and deciding, one by one, whether suppressing the no-op write was safe. I applied the guard to 11 of them. The other 21 I left bare, on purpose.
This is the question I ran each call site through:
flowchart TD
S["An ON CONFLICT DO UPDATE call site"] --> Q1{"new value changes the row<br/>by construction?<br/>(counter ++, array concat)"}
Q1 -->|Yes| B1["leave bare<br/>guard would never fire"]
Q1 -->|No| Q2{"a BEFORE UPDATE trigger bumps<br/>a timestamp that retention or<br/>a refresh job reads?"}
Q2 -->|Yes| B2["leave bare<br/>suppressing it breaks retention"]
Q2 -->|No| Q3{"caller reads RETURNING<br/>from the write?"}
Q3 -->|Yes| B3["leave bare<br/>suppressed UPDATE returns no row"]
Q3 -->|No| Q4{"the write is a deliberate<br/>freshness / cached_at bump?"}
Q4 -->|Yes| B4["leave bare<br/>the freshness signal must move"]
Q4 -->|No| G["guard it<br/>IS DISTINCT FROM EXCLUDED"]
classDef bare fill:#fde2e2,stroke:#c0392b,color:#000;
classDef guard fill:#e2f0d9,stroke:#27ae60,color:#000;
class B1,B2,B3,B4 bare;
class G guard;
Four ways to land on “leave it bare,” one way to land on “guard it.” The bare bucket won, 21 sites to 11.
The update changes the row by construction. Counter increments (update_counter = update_counter + 1), array concatenations, anything where the new value is derived from the old. The guard would never fire here, so adding it is pure noise.
A trigger depends on the write happening. This is the trap. Several of our tables inherit a base model with a BEFORE UPDATE trigger that bumps mergify_data_updated_at on every UPDATE. If you suppress the UPDATE, you suppress that timestamp bump, and two systems downstream stop seeing the row:
flowchart LR
U["UPDATE executes"] --> T["BEFORE UPDATE trigger bumps<br/>mergify_data_updated_at"]
T --> R1["background refresh job<br/>picks rows by timestamp"]
T --> R2["retention sweep<br/>deletes rows with<br/>old timestamps"]
G["guard suppresses<br/>the no-op UPDATE"] -.->|"trigger never fires"| T
Skip the no-op write and you quietly break retention, with no error to tell you. Not every table on that base class is affected, though. github_pull_request and github_commit_status inherit the same trigger, but nothing reads their mergify_data_updated_at, so they keep the optimization. The exception is per-table, not per-base-class, which means you have to actually trace each one.
Something reads .returning(...). A suppressed UPDATE returns no row. If the calling code consumes RETURNING to tell whether it just created or updated, or to emit a diff event, then suppressing the write turns a real update into an empty result, and the caller’s logic falls apart. Pause/resume on queues, a few config tables, the product-activation API: all left bare.
The write is a deliberate freshness bump. Cache tables whose whole point is a cached_at timestamp that must move on every refresh, even when the cached payload is byte-identical. Suppressing those defeats the cache’s freshness signal.
I found the trigger dependency the way I find a lot of this kind of coupling now: I asked an AI agent to produce a report of every table with the no-op-upsert pattern and every code path that reads mergify_data_updated_at, then made the call myself per site. An audit of invisible coupling across 32 call sites is exactly the kind of enumeration an agent is good at. The judgment about what to do with each one stayed with me.
The principle
Match the guard to how the data behaves.
The guard pays where update volume is high and real mutation is low. That’s github_repository and github_account: touched constantly, changed almost never. For effectively-immutable data like that, a bare DO UPDATE on every write is the wrong default. A guarded update, or DO NOTHING plus reconciliation for the rare real change, is right.
The guard has a cost, though it’s a small one. Postgres evaluates the WHERE last, after the conflicting row is already located and locked, so it’s a per-column comparison on a single row, not a scan. The one place it stops being cheap is wide TOASTed columns: IS DISTINCT FROM on a big JSONB or text value has to detoast and compare bytes. And on an insert-heavy table like github_commit_status (374 million inserts against 16 million updates, 61% of them HOT), there’s almost no no-op churn to skip in the first place. We guarded it anyway since the comparison costs next to nothing, but the static tables are where the guard earns its keep.
And on the 21 sites I left bare, the reason was never performance, it was a side effect something else depends on. The shape of the write should follow the shape of the data, and sometimes the data needs the write to happen.
What to check in your own database
If you run high-volume upserts on Postgres, go look at n_tup_upd versus n_tup_ins in pg_stat_user_tables, and at your HOT-update ratio. A table with far more updates than inserts and a low HOT percentage, on data you know is fairly static, is burning tuples, WAL, index writes, and replication bytes for nothing. The fix is a WHERE clause with IS DISTINCT FROM EXCLUDED.col, and the real work is being honest about the handful of places where the write has a side effect you need.
We’ve still got one cleanup left: those exceptions tied to mergify_data_updated_at are coupling we’d rather not have. Once we stop leaning on that trigger, those tables can take the guard too. Those 21 exceptions are a map of the coupling we still have to pay down.