Julien Danjou

May 7, 2025

3 min

read

CI Failures Don’t Just Break Builds: They Break Focus

landmark photography of trees near rocky mountain under blue skies daytime
landmark photography of trees near rocky mountain under blue skies daytime

CI always breaks at the worst time — slowing fixes, killing flow, and leaving teams rerunning jobs by ritual. We built CI Insights to show which jobs flake, where time is lost, and why your pipeline hurts. Real visibility, zero YAML edits.

A few weeks ago, something broke in production.

No big deal — we've all been there. I did what I've done hundreds of times before:

Checked logs, wrote a patch, pushed to a new branch, and waited for CI.

Except CI failed. Not because of my change, but because someone had merged a PR 20 minutes earlier that broke main.

Now I'm stuck. The fix is ready, but I can't ship it without bypassing the tests or waiting for someone else to fix what they broke. From a clean engineering process point of view, this is a disaster.

So I face the same choice many of us do in this situation:

Force-merge the fix and cross my fingers, or hold off on fixing prod.

I clicked merge. Crossed my fingers. And hoped that while I fixed one problem, I didn't just create two more.

CI doesn't just fail. It fails at the worst possible moment.

You can build a beautiful, automated pipeline. You can configure your cache keys, lint your YAML, and badge your build.

But none of that helps you when:

  • A flaky test randomly fails a PR at 6:30 p.m.

  • GitHub Actions hits a Docker pull rate limit again

  • A CI step you don't control starts failing globally

And here's the kicker: most of us rerun the job, get a green check, and move on. No root cause. No visibility. No long-term fix.

We're debugging by ritual.

The real cost isn't money — it's momentum.

When CI breaks, you don't lose dollars.

You lose flow.

You're in the middle of a fix. Or a feature. Or just trying to finish the last ticket of the sprint. CI breaks, and now you're digging through logs from a job you didn't write for a failure you didn't cause.

You rerun the job. Still red.

Rerun again. It's green. You merge, slightly less confident than before.

Your 20-minute bugfix becomes a 90-minute support fire. And you're left wondering:

Did I actually fix the bug? Or did I just push the pain to Future Me?

Why do we tolerate this?

Because CI is treated like plumbing.

If it works, you don't think about it. If it doesn't, you patch, rerun, and carry on.

Nobody "owns" CI quality. Nobody tracks flake rates in a dashboard.

Nobody budgets time for fixing transient failures. And so it creeps.

What was a one-off rerun last week becomes standard practice. What was a reliable build becomes a minefield of red crosses and Slack pings.

So we started building CI Insights.

We didn't want to replace CI. But we wanted answers:

  • What jobs are failing the most?

  • Are they flaky? Or actually broken?

  • Which tests slow us down the most?

  • What's our actual lead time from PR to prod?

  • Why are reruns our default fix?

So we built something that watches your CI without changing it.

No YAML edits. No instrumentation. Real-time observability and answers.

CI Insights tells us:

  • Which jobs flake

  • Which jobs got slower this month

  • How much time does our team spend rerunning things

  • How much merge delay CI is actually causing

  • What our deployment frequency looks like (DORA-style)

It's like going from looking at raw logs… to having a dashboard that tells why your team is grumpy.

What you can do today

Even without CI Insights, you can start spotting CI drift:

  • Track retry rates. If your reruns are increasing, something’s decaying.

  • Monitor merge delay. If jobs block PRs more than they fail, you have friction.

  • Surface flakes. Build a script that scans past failed-then-passed jobs.

  • Watch the slow creep. Job duration going up over time? That’s silent tech debt.

CI isn't a tool. It's a mirror.

And when it breaks, it reflects the messiest part of our engineering process: the stuff we patch instead of fix.

Want to see your own CI like this?

We're opening up the CI Insights beta for GitHub Actions users. It's free while in beta and built for engineers like us who just want their pipelines to work — or at least make sense when they don't.

We'll show you:

  • The flakiest jobs in your org

  • Job cost and rerun trends

  • Real-time CI status across repos

  • Slack alerts and auto-retries for the noisy stuff

Try it out!

Stay ahead in CI/CD

Blog posts, release news, and automation tips straight in your inbox

Stay ahead in CI/CD

Blog posts, release news, and automation tips straight in your inbox

Recommended blogposts

9 min

read

How We Turned Claude Into a Cross-System Support Investigator

Support triage at Mergify meant juggling Datadog, Sentry, PostgreSQL, Linear, and source code. We built a repo with MCP servers and Claude Code that investigates tickets in parallel — cutting triage from 15 minutes to under 5, with 75% first-pass accuracy.

Julian Maurin

9 min

read

How We Turned Claude Into a Cross-System Support Investigator

Support triage at Mergify meant juggling Datadog, Sentry, PostgreSQL, Linear, and source code. We built a repo with MCP servers and Claude Code that investigates tickets in parallel — cutting triage from 15 minutes to under 5, with 75% first-pass accuracy.

Julian Maurin

9 min

read

How We Turned Claude Into a Cross-System Support Investigator

Support triage at Mergify meant juggling Datadog, Sentry, PostgreSQL, Linear, and source code. We built a repo with MCP servers and Claude Code that investigates tickets in parallel — cutting triage from 15 minutes to under 5, with 75% first-pass accuracy.

Julian Maurin

9 min

read

How We Turned Claude Into a Cross-System Support Investigator

Support triage at Mergify meant juggling Datadog, Sentry, PostgreSQL, Linear, and source code. We built a repo with MCP servers and Claude Code that investigates tickets in parallel — cutting triage from 15 minutes to under 5, with 75% first-pass accuracy.

Julian Maurin

7 min

read

Spinners Are the UX Equivalent of “TODO: Fix Later”

We replaced a spinner with a chart-shaped skeleton and realized loading states are part of the layout contract. Bad skeletons cause layout shift. Good ones match the final UI exactly. Here's what we learned fixing ours — and why CLS is a UX problem, not just an SEO metric.

Alexandre Gaubert

7 min

read

Spinners Are the UX Equivalent of “TODO: Fix Later”

We replaced a spinner with a chart-shaped skeleton and realized loading states are part of the layout contract. Bad skeletons cause layout shift. Good ones match the final UI exactly. Here's what we learned fixing ours — and why CLS is a UX problem, not just an SEO metric.

Alexandre Gaubert

7 min

read

Spinners Are the UX Equivalent of “TODO: Fix Later”

We replaced a spinner with a chart-shaped skeleton and realized loading states are part of the layout contract. Bad skeletons cause layout shift. Good ones match the final UI exactly. Here's what we learned fixing ours — and why CLS is a UX problem, not just an SEO metric.

Alexandre Gaubert

7 min

read

Spinners Are the UX Equivalent of “TODO: Fix Later”

We replaced a spinner with a chart-shaped skeleton and realized loading states are part of the layout contract. Bad skeletons cause layout shift. Good ones match the final UI exactly. Here's what we learned fixing ours — and why CLS is a UX problem, not just an SEO metric.

Alexandre Gaubert

5 min

read

Claude Didn’t Kill Craftsmanship

AI doesn't remove craftsmanship: it moves it. The goal was never to protect the purity of the saw. It's to build good furniture. Engineers can now focus on intent, judgment, and product quality instead of translating tickets into code.

Rémy Duthu

5 min

read

Claude Didn’t Kill Craftsmanship

AI doesn't remove craftsmanship: it moves it. The goal was never to protect the purity of the saw. It's to build good furniture. Engineers can now focus on intent, judgment, and product quality instead of translating tickets into code.

Rémy Duthu

5 min

read

Claude Didn’t Kill Craftsmanship

AI doesn't remove craftsmanship: it moves it. The goal was never to protect the purity of the saw. It's to build good furniture. Engineers can now focus on intent, judgment, and product quality instead of translating tickets into code.

Rémy Duthu

5 min

read

Claude Didn’t Kill Craftsmanship

AI doesn't remove craftsmanship: it moves it. The goal was never to protect the purity of the saw. It's to build good furniture. Engineers can now focus on intent, judgment, and product quality instead of translating tickets into code.

Rémy Duthu

Curious where your CI is slowing you down?

Try CI Insights — observability for CI teams.

Curious where your CI is slowing you down?

Try CI Insights — observability for CI teams.

Curious where your CI is slowing you down?

Try CI Insights — observability for CI teams.

Curious where your CI is slowing you down?

Try CI Insights — observability for CI teams.