Skip to content
Guide

CI Auto-Retry

Auto-retry exists because CI is noisy. Network blips, runner pool churn, a test that hits a race condition once in a hundred runs. Retrying absorbs the noise so engineers do not chase it. Done well, it is a productivity multiplier. Done bluntly, it is the reason your suite has been quietly broken for six months.

By Julien Danjou, Co-founder & CEO of Mergify Updated

In one paragraph

Auto-retry re-runs a failed CI job, step, or test so transient noise does not look like a real failure. The trade is that it can also mask a real flake or a slow regression. The fix is to retry conditionally (on known-transient signals), bound the retry budget (one or two attempts), and track the retry rate as a metric, not as exhaust.

Where auto-retry lives

Retries can happen at three levels, each catching a different class of failure:

  • Test level: inside the test framework. A flaky test that hits a race condition gets re-run by pytest-retry, jest.retryTimes, or RSpec's retry gem. Useful when the flake is in the test code itself.
  • Job level: the CI runner re-executes the whole job after a failure. Useful when the failure was the runner dying mid-suite, an image-pull failure, or a network timeout in setup.
  • Workflow level: the higher orchestration layer reruns a whole pipeline. Useful when something failed before tests even started (clone failed, secret rotation glitch, registry hiccup).

The three layers protect against different failure modes. Framework retries cannot rerun a job that died. Job retries cannot rerun a single failing test inside an otherwise-passing job. Use whichever level matches the failure.

Conditional retry beats blanket retry

The biggest lever in any auto-retry policy is conditionality. A blanket retry-on-any-failure rule reruns real failures along with the transient ones, and a test that was about to fail for a real reason gets a second chance to pass on luck. The right rule is narrower:

  • Retry when the exit code matches a known-transient set (network errors, runner errors).
  • Retry when the log contains a specific known-flaky pattern ("image pull backoff", "context deadline exceeded").
  • Retry tests explicitly marked as flaky in a quarantine list, not the whole suite.
  • Cap the retry budget to one or two attempts per job.

Mergify CI Insights ships rule-based auto-retry that fits this shape: conditions written in the same YAML rule engine as the rest of the queue, full event log so the retry history is debuggable, and per-job retry stats so the policy can be tuned over time.

Track retries as a metric

The single most important thing you can do with auto-retry is measure it. Retry rate is a leading indicator of suite health. If 2% of jobs are passing on retry, the suite is fine. If 30% of jobs are passing on retry, the suite is broken and the retry is hiding it.

Watch three numbers in particular:

  • Percent of jobs that needed a retry to pass.
  • Top tests by retry count, week over week.
  • Retries per merged PR (and the trend).

When a specific test shows up at the top of the retry list, it is no longer a transient issue. It is a flaky test, and it belongs in quarantine or in the fix pile.

FAQ

What is CI auto-retry?

Auto-retry is the practice of automatically re-running a failed CI job or test without engineer intervention. The retry can be at the job level (the whole pipeline runs again), the step level (a specific stage), or the test level (a single test reruns). The goal is to absorb transient failures (network blips, runner pool issues, flaky tests) so engineers spend their time on real bugs.

Is auto-retry the same as test retry?

Test retry is one form of it. A failing test gets re-run, often up to two or three times, before being marked as failing for real. Job retry is the broader version: the whole CI run is re-executed when a transient infrastructure error caused the failure. Both are auto-retry. They solve different parts of the same problem.

What is the risk of auto-retry?

It masks signal. A test that flakes 20% of the time will pass on a 3-retry policy almost all of the time, which makes it look fine even though it is still flaky. The result is a slow accumulation of unreliable tests that you only notice once a retry budget runs out or a real regression sneaks through. The mitigation is to track retry rates and treat them as a metric, not as exhaust.

How many retries are safe?

Most teams settle on one or two retries for test-level flakes and one retry for job-level infrastructure errors. Past that, the retry is no longer absorbing transient noise. It is hiding a problem.

Should I retry on every failure or only specific ones?

Only specific ones. A blanket retry policy retries real failures along with the transient ones. The right pattern is to retry on signals you can identify: a specific exit code, a known error message (network timeout, image pull failed, runner lost), or a known-flaky test marked as such. Mergify CI Insights uses rule-based retry so the retry policy can be conditional rather than blanket.

How is auto-retry different from a retry plugin in my test framework?

Framework-level retry (pytest-retry, jest.retryTimes, RSpec's retry gem) runs inside the test process. CI-level auto-retry runs above it: it can re-run the entire job after a runner died, or after a Docker pull failed, neither of which the test framework can see. Both layers are useful, and they protect against different failure modes.

Can auto-retry slow down CI?

Yes, when retries replace fixes. A suite that depends on retry to be green costs 1.2 to 1.5 times its base runtime in steady state and is more variable in latency. Track retry rate over time; if it climbs, the suite is degrading.

Retry the noise. See the signal.

Mergify CI Insights ships rule-based auto-retry, full event log, and per-job retry analytics, so retries become a managed policy instead of a quiet failure-masking habit.