Hypothesis is not flaky. Your code under test is, and Hypothesis is the messenger.

Why a property test that passed for months suddenly fails on a CI run with an example you cannot reproduce locally, the @example pattern that pins the failing case, and why the right response is never to delete the test.

A pytest test using Hypothesis has been green for six months. Today’s CI run fails with AssertionError: assert 9223372036854775808 == -9223372036854775808. The number is -2^63, the smallest signed 64-bit integer. The test never used that number before. Tomorrow’s run is green again. Your first instinct is to retry until it passes. Resist it.

We see this pattern often enough on Mergify Test Insights that it earned its own slot in our flaky pytest catalog. The cause is Hypothesis discovering an edge case your code never handled. The fix is to add the case to the test, not to tame Hypothesis.

What you see

from hypothesis import given, strategies as st

@given(st.integers())
def test_round_trip(n):
    assert from_str(to_str(n)) == n

Hypothesis generates a sample of integers each run. Most runs pick easy cases — small positives, zero, small negatives. The test passes. Once every few hundred runs, Hypothesis explores deeper and picks -2^63, where to_str produces a string that from_str parses as 2^63 - 1 due to integer overflow in the parser. The assertion fails.

The frustrating part is the test passing on rerun. The next run picks a different sample, hits no boundary cases, returns green. CI marks it as a flake. The team ignores it. Next week the same pattern fires, the team retries again, and the underlying bug ships to production six weeks later when a customer happens to send the same value.

Why the test is right and the code is wrong

Hypothesis is doing exactly what it advertises: generating examples that violate your assertion. The test asserted from_str(to_str(n)) == n for any integer. The function pair does not satisfy that for n = -2^63. The test discovered the bug. The bug existed before the test ran, and it will exist after.

The only thing the failing run gives you that the passing runs do not is the failing example. That is the data the bug fix needs. If you mark the test as flaky and stop looking, you throw away the only diagnostic.

The naive fix and why it is wrong

@given(st.integers(min_value=-1000, max_value=1000))
def test_round_trip(n):
    assert from_str(to_str(n)) == n

Restrict the strategy to numbers your code happens to handle correctly. The test goes green. The bug stays. The next user who passes a number outside [-1000, 1000] hits the bug in production.

@pytest.mark.flaky(reruns=3)
@given(st.integers())
def test_round_trip(n):
    assert from_str(to_str(n)) == n

pytest-rerunfailures reruns the failing test. The next sample picks easier examples. The build goes green. The bug stays. This is the worst possible outcome: you have a test that proves the bug exists and a CI policy that suppresses the proof.

The fix that holds

Two steps. First, capture the failing example so you do not lose it. Hypothesis prints it in the failure output:

Falsifying example: test_round_trip(n=-9223372036854775808)

Add it to the test as an explicit @example:

from hypothesis import example, given, strategies as st

@given(st.integers())
@example(-(2**63))
def test_round_trip(n):
    assert from_str(to_str(n)) == n

The test now always runs the failing case in addition to the random sample. As long as from_str(to_str) is broken for that input, the test fails every run, not occasionally. The flakiness goes away because the failure is no longer probabilistic.

Second, fix the production code so the assertion holds. The test should pass on every run, including the explicit example, before you merge.

Pin the seed in CI when you need bisect

Hypothesis decides which examples to run based on a per-run seed. If you need to reproduce a failing CI run locally, set the seed Hypothesis printed in the failure output:

@settings(derandomize=True)
@given(st.integers())
def test_round_trip(n):
    assert from_str(to_str(n)) == n

derandomize=True makes Hypothesis pick the same sample every run, deterministically. Useful in CI when you want runs to be reproducible. Use it sparingly — random sampling is what makes Hypothesis find new bugs.

For a one-off reproduction without changing the test:

HYPOTHESIS_PROFILE=ci pytest tests/test_round_trip.py

Configure the ci profile in conftest.py:

import hypothesis
hypothesis.settings.register_profile("ci", derandomize=True, max_examples=200)

Now CI runs use a fixed seed and 200 examples per test. Local runs use the default profile (random seed, default example count) so engineers still find new bugs while iterating.

When the property is the wrong assertion

Sometimes Hypothesis finds an example your code legitimately should not handle. A function that takes a non-empty list is allowed to fail on []. The fix is to scope the strategy:

@given(st.lists(st.integers(), min_size=1))
def test_average(xs):
    assert min(xs) <= average(xs) <= max(xs)

This is different from restricting the strategy to hide a bug. Here the contract of the function under test is “non-empty list,” and the test now matches the contract. The previous failing example ([]) was not a bug — it was a test that asked the function to do something it never promised.

The judgment call: is the failing example a real input your code might see, or is it outside the function’s contract? If it is real, fix the code. If it is outside the contract, scope the strategy and document why.

How Mergify catches this before you ship

Hypothesis failures are among the easiest to misclassify as flake. The test passes 99% of the time. The failing 1% is a real bug. Without instrumentation, the team’s pattern-matching (“intermittent, retry it”) is exactly wrong here.

Test Insights treats Hypothesis-style failures as real bugs by default: a single failure on the default branch is enough to flag the test, and the dashboard surfaces the failing example from the failure output. You see “test_round_trip failed with n=-9223372036854775808” in the alert, not “test_round_trip flaky.”

Quarantine is reserved for tests where the cost of fixing exceeds the cost of the noise — for Hypothesis failures, that almost never applies, because the cost of fixing is small and the cost of ignoring is shipping the bug.

Property-test failures are signal, not noise. Point Mergify at your suite so they get treated that way. Native plugin: pytest-mergify. One pip install and you’re set.

More patterns like this

Hypothesis seed non-determinism is one of the eight patterns in the flaky-tests-in-pytest guide. The others are variants of the same theme: tests whose results depend on something the test author did not consciously control. Fixture teardown order, xdist worker scheduling, monkeypatch leakage, async event-loop scope. Cause and symptom usually live in different files.

Hypothesis is the rare one where the failures are honest. The test is telling you the truth. Your only job is to listen.