Why are my JUnit tests flaky in CI but pass locally?

Your laptop and your CI runner differ in CPU count, parallelism, JDK build, and the order JUnit happens to pick for test methods. Tests that race on static state, hold references to a Spring context that another test evicts, or read the system clock surface those issues only under CI's tighter resource budget. Reproduce locally with `mvn test -T 1C -Dsurefire.rerunFailingTestsCount=0` (or the same Surefire config CI uses) to expose the failure, then fix the underlying coupling before pushing.

How do I detect flaky JUnit tests?

JUnit alone cannot tell flaky from broken since each run gives one data point per test. You need to run the same commit multiple times and compare results. Mergify Test Insights does that on every PR and on the default branch, scores each test, and surfaces the tests whose pass rate drops below a confidence threshold.

Does Surefire's `rerunFailingTestsCount` fix flaky tests?

No, it hides them. A test that fails on attempt 1 and passes on attempt 2 is still broken; you have only decided not to look at the failure. Use Surefire's `rerunFailingTestsCount` as a temporary bandage for a test you are actively fixing, never as a permanent policy. For visibility without blocking the merge queue, quarantine instead of retry.

How do I quarantine a flaky JUnit test without deleting it?

Mergify Test Insights quarantines the test automatically once its confidence score drops. The test still runs in the suite, but a failing result no longer blocks merges and its noise no longer drowns out real signal. When the test stabilizes on main, quarantine lifts automatically. No `@Disabled`, no commented-out tests, no orphaned files.

How do I make JUnit 5 run tests in a deterministic order?

Annotate the class with `@TestMethodOrder(MethodOrderer.OrderAnnotation.class)` and add `@Order(N)` to each method. But ordering is almost always a smell: a test class that needs ordering is a chain of dependent tests pretending to be independent. Refactor the prerequisite into `@BeforeEach` so each test stands alone, and reserve `@TestMethodOrder` for the rare case (a smoke test that walks an end-to-end happy path) where ordering is the assertion.

Flaky tests in JUnit.
Named, fixed, and quarantined.

Flaky JUnit suites are not random. They follow patterns: static state leaking across forks, Spring context cache evictions, Mockito mock bleed, real-clock assertions, leaked MockWebServers. Name them, fix them, quarantine what is left.
Your CI stays green.

By Rémy Duthu, Software Engineer, CI Insights · Published April 2026

Why JUnit is uniquely flaky

JUnit 5 is the assembly language of Java testing: a small core, a large extension surface, and a default model that runs every test against a single JVM unless you tell it otherwise. The single-JVM default is the appeal (cold start matters when you have ten thousand tests) and the source of most flakes. static fields, the Spring context, the system clock, the system class loader, and any singleton that lives in static state survives across every test in the same fork.

Layer on the build tooling. Maven Surefire and Gradle's test runner add forking and parallelism on top of JUnit, and a parallel run that worked sequentially can race the moment two tests in the same fork touch the same static cache. @MockBean in a Spring Boot test forces a context reload that breaks adjacent tests still expecting the cached context. Each pattern has a clean fix once you can name the failure mode.

The patterns are finite. We've seen the same eight on Mergify Test Insights across hundreds of JUnit suites: method-order assumptions without @TestMethodOrder, static-field state leaking across tests, Surefire forkCount and parallel resources races, Mockito mock leakage under @TestInstance(PER_CLASS), @SpringBootTest context-cache surprises, time-based assertions without an injected Clock, MockWebServer instances that never get shut down, and Surefire rerunFailingTestsCount hiding real bugs. Each has a clean fix once you can name it.

The 8 patterns behind most flaky suites

Pattern 1

Method-order assumptions without @TestMethodOrder

Symptom. A test class that has been green for years suddenly fails on a CI runner with a different JDK build, with one test asserting on state another test creates.

Root cause. JUnit 5 does not guarantee test method execution order by default. The platform picks an order at runtime that is deterministic but not promised across JDK versions, JVM startup options, or platform updates. A test class that implicitly depends on test ordering (test A creates a record, test B reads it) keeps passing until the order changes underneath you.

class UserServiceTest {
    private static User created;

    @Test
    void createsUser() {
        created = userService.create("Rémy");
        assertNotNull(created);
    }

    @Test
    void readsCreatedUser() {
        // assumes createsUser ran first; brittle
        User found = userService.findById(created.getId());
        assertEquals("Rémy", found.getName());
    }
}

Fix. Make every test independent. Build the prerequisite state in @BeforeEach or inside the test itself. When a test really must run after another (database migration smoke tests, end-to-end happy paths), declare it explicitly with @TestMethodOrder(OrderAnnotation.class) and number every method.

class UserServiceTest {

    @Test
    void createsUser() {
        User created = userService.create("Rémy");
        assertNotNull(created);
    }

    @Test
    void readsCreatedUser() {
        User created = userService.create("Rémy");
        User found = userService.findById(created.getId());
        assertEquals("Rémy", found.getName());
    }
}

With Mergify. Test Insights records the JVM and JDK version of every CI run and groups failures by their environment fingerprint. When a test class only fails after a JDK upgrade and the failures all sit in the same class, the dashboard surfaces the order-dependence pattern at PR time.

Pattern 2

Static-field state leaking across tests

Symptom. A test that mutates a static counter or a feature-flag map passes alone and fails when its class shares a Surefire fork with another test class.

Root cause. JUnit creates a fresh instance of the test class for each test method by default, but static fields live for the JVM. A static cache, a static "current user" set in test A, or a flag toggled in @BeforeAll survives every test that runs after, including in unrelated test classes that share the fork.

class FeatureFlagTest {
    static Map<String, Boolean> FLAGS = new HashMap<>();

    @BeforeAll
    static void setUp() {
        FLAGS.put("NEW_BILLING", true);
    }

    @Test
    void newBillingPathFires() {
        assertTrue(FLAGS.get("NEW_BILLING"));
    }
}

// elsewhere in the same fork
class LegacyBillingTest {
    @Test
    void legacyChargeReturnsCorrectAmount() {
        // expected: NEW_BILLING == false (default). Actual: true, leaked.
        assertEquals(100, billingService.charge(...));
    }
}

Fix. Move shared state into instance fields and reset it in @AfterEach, or use @TestInstance(PER_CLASS) with explicit cleanup. For genuine global state (system properties, time zones), use a JUnit extension that snapshots and restores around each test.

class FeatureFlagTest {
    private FeatureFlags flags;

    @BeforeEach
    void setUp() {
        flags = new FeatureFlags();
        flags.set("NEW_BILLING", true);
    }

    @Test
    void newBillingPathFires() {
        assertTrue(flags.get("NEW_BILLING"));
    }
}

With Mergify. Test Insights detects the cross-class signature: test class B fails consistently when class A ran in the same fork and never alone. The dashboard groups failures by their predecessor so the leaking-static pattern is the obvious lead.

Pattern 3

Surefire forkCount and parallel resources races

Symptom. A test suite that runs green with Surefire `-T 1` fails with `-T 4` on assertions involving counters, ports, or temp directory paths.

Root cause. Surefire's forkCount and parallel execution run multiple tests in the same JVM concurrently when configured with parallel mode. Anything those threads share (a static counter, an in-memory database with a fixed name, a hardcoded port) becomes a race the moment parallelism kicks in. The test does not change; the build configuration does.

<!-- pom.xml -->
<plugin>
  <artifactId>maven-surefire-plugin</artifactId>
  <configuration>
    <parallel>methods</parallel>
    <threadCount>4</threadCount>
    <forkCount>1</forkCount>
  </configuration>
</plugin>

// CounterTest.java
class CounterTest {
    static AtomicInteger counter = new AtomicInteger();

    @Test
    void incrementsToOne() {
        int v = counter.incrementAndGet();
        assertEquals(1, v); // fails when 4 threads share the same counter
    }
}

Fix. Pick a parallelism mode that matches the suite's isolation guarantees. JUnit 5's per-class concurrency with one thread per class avoids method-level interleaving while still parallelizing across classes. For shared external resources (ports, files, database schemas), use unique names per test (@TempDir, ephemeral ports, namespace by TestInfo).

# junit-platform.properties
junit.jupiter.execution.parallel.enabled = true
junit.jupiter.execution.parallel.mode.default = same_thread
junit.jupiter.execution.parallel.mode.classes.default = concurrent

With Mergify. Test Insights tags failures that only appear under parallel runs as parallelism-sensitive and never under sequential reruns. The dashboard surfaces the parallel-only signature so you know the test is not broken on its own.

Pattern 4

Mockito mock leakage under PER_CLASS lifecycle

Symptom. A spy or mock you set up in one test starts returning canned values in a sibling test that never declared it, and `@BeforeEach` resets do not help.

Root cause. JUnit 5's default lifecycle creates a fresh test instance per method, so @Mock fields are recreated and stubbing does not leak. Switch to @TestInstance(Lifecycle.PER_CLASS) (often added when a class needs a non-static @BeforeAll or to amortize setup) and the same instance is shared across every method. @Mock fields, the @InjectMocks subject, and any stubbing you set in one test all carry over into the next.

@ExtendWith(MockitoExtension.class)
@TestInstance(Lifecycle.PER_CLASS) // one instance for the whole class
class UserServiceTest {
    @Mock UserRepository repo;
    @InjectMocks UserService service;

    @Test
    void findsUserById() {
        when(repo.findById(1L)).thenReturn(Optional.of(new User("Rémy")));
        assertEquals("Rémy", service.find(1L).getName());
    }

    @Test
    void returnsEmptyWhenNotFound() {
        // repo.findById(1L) still returns the stub from the previous test
        // because the instance (and its mocks) is shared under PER_CLASS
        assertTrue(service.find(2L).isEmpty());
    }
}

Fix. Default to JUnit 5's per-method lifecycle so mocks are recreated per test. When you need @TestInstance(PER_CLASS) for non-static @BeforeAll hooks, reset mocks explicitly in @AfterEach with Mockito.reset(...) so stubbing does not carry across tests.

@ExtendWith(MockitoExtension.class)
@TestInstance(Lifecycle.PER_CLASS)
class UserServiceTest {
    @Mock UserRepository repo;
    @InjectMocks UserService service;

    @AfterEach
    void resetMocks() {
        Mockito.reset(repo); // clear stubbing between tests
    }

    @Test
    void findsUserById() {
        when(repo.findById(1L)).thenReturn(Optional.of(new User("Rémy")));
        assertEquals("Rémy", service.find(1L).getName());
    }

    @Test
    void returnsEmptyWhenNotFound() {
        assertTrue(service.find(2L).isEmpty());
    }
}

With Mergify. Test Insights groups failures whose only signature is `UnnecessaryStubbingException` or assertion-on-stubbed-default into a single mock-leakage bucket. The dashboard surfaces the offending test method so the missing reset is one fix instead of N.

Pattern 5

@SpringBootTest context-cache surprises

Symptom. A Spring Boot integration test passes alone, fails when the suite runs end to end with `Failed to load ApplicationContext`, and a rerun of the failing class passes.

Root cause. Spring caches ApplicationContext instances by their configuration signature. @MockBean, @DynamicPropertySource, and @TestPropertySource change that signature, forcing a context reload that disposes the old context. A test that holds a reference to a bean from the cached context fails with a BeanCreationException when a sibling triggers a reload.

@SpringBootTest
class CheckoutTest {
    @MockBean PaymentClient payments; // forces a fresh context

    @Test
    void chargesCard() { ... }
}

@SpringBootTest
class InventoryTest {
    @Autowired InventoryService inventory; // expected to be from the cached context

    @Test
    void decrementsStock() {
        // CheckoutTest's @MockBean evicted the cached context
        // inventory was wired against the now-disposed context → fails
    }
}

Fix. Group tests that share configuration into a base class so they reuse one context. When you genuinely need a different bean shape, accept the reload cost rather than scattering @MockBean across many test classes. Spring's context cache statistics in the test log show how many reloads your suite triggers.

// Shared base class
@SpringBootTest
abstract class IntegrationTestBase {
    @MockBean PaymentClient payments; // declared once
}

class CheckoutTest extends IntegrationTestBase { ... }
class InventoryTest extends IntegrationTestBase { ... }

With Mergify. Test Insights catches the order-dependent signature: a Spring test only fails when run after a specific other test class that shifts the context configuration. The dashboard tags the dependency so the @MockBean blast radius is visible.

Pattern 6

Time-based assertions without an injected Clock

Symptom. A test that asserts an event happens "within the last second" passes locally and fails on the slower CI runner with a timestamp 1.2 seconds old.

Root cause. Calling Instant.now() or LocalDateTime.now() directly inside production code reads the system clock. A test that asserts on a freshly created timestamp races against scheduler jitter. Locally the assertion fires within a millisecond; on CI the same code runs after a longer pause and the assertion's tolerance window is too tight.

class AuditLogger {
    void log(String event) {
        store.put(event, Instant.now()); // direct clock read
    }
}

@Test
void auditTimestampIsRecent() {
    auditLogger.log("checkout");
    Instant logged = store.get("checkout");
    assertTrue(Instant.now().isBefore(logged.plusSeconds(1))); // racy
}

Fix. Inject a java.time.Clock into the production class so tests can swap in Clock.fixed(...). The production wiring uses Clock.systemUTC(); the tests use a frozen clock and assert exactly.

class AuditLogger {
    private final Clock clock;
    AuditLogger(Clock clock) { this.clock = clock; }

    void log(String event) {
        store.put(event, clock.instant());
    }
}

@Test
void auditTimestampIsRecent() {
    Clock fixed = Clock.fixed(Instant.parse("2026-01-01T00:00:00Z"), ZoneOffset.UTC);
    AuditLogger logger = new AuditLogger(fixed);
    logger.log("checkout");
    assertEquals(fixed.instant(), store.get("checkout"));
}

With Mergify. Test Insights links timing failures to their CI runner type. When a test only fails on the slower runner pool and the failure is always within a small tolerance window, the dashboard surfaces the real-clock dependency.

Pattern 7

MockWebServer instances that never get shut down

Symptom. A long suite run eventually fails with `BindException: Address already in use` or `Too many open files`, and a rerun of the failing class passes.

Root cause. MockWebServer binds an ephemeral port and starts a background dispatcher thread. Forgetting to call server.shutdown() in @AfterEach leaks both. After enough tests the JVM hits the file-descriptor limit and the next test fails on socket creation.

class HttpClientTest {
    @Test
    void postsJson() throws Exception {
        MockWebServer server = new MockWebServer();
        server.start();
        server.enqueue(new MockResponse().setBody("{}"));
        // no shutdown
        client.post(server.url("/").toString(), "{}");
    }
}

Fix. Lift the server into a field with @BeforeEach + @AfterEach, or register it as a JUnit 5 extension so cleanup is guaranteed even when the test fails.

class HttpClientTest {
    private MockWebServer server;

    @BeforeEach
    void setUp() throws IOException {
        server = new MockWebServer();
        server.start();
    }

    @AfterEach
    void tearDown() throws IOException {
        server.shutdown();
    }

    @Test
    void postsJson() throws Exception {
        server.enqueue(new MockResponse().setBody("{}"));
        client.post(server.url("/").toString(), "{}");
    }
}

With Mergify. Test Insights groups failures whose only signature is `BindException` or `Too many open files` into an fd-exhaustion bucket and surfaces the test that first triggered the leak rather than the random victim that ran when the JVM ran out.

Pattern 8

Surefire rerunFailingTestsCount hiding real bugs

Symptom. Your build is green. A user hits a bug your tests were supposed to catch.

Root cause. surefire.rerunFailingTestsCount reruns failing tests up to N times and reports the last result. A real race that loses on attempt 1 and wins on attempt 2 gets reported as green. The bug is still there. The build has decided not to look at it.

<plugin>
  <artifactId>maven-surefire-plugin</artifactId>
  <configuration>
    <rerunFailingTestsCount>3</rerunFailingTestsCount>
  </configuration>
</plugin>

Fix. Do not retry at the build level. When a test is genuinely flaky, fix it. When the fix takes longer than a session, quarantine it instead. That keeps the signal visible without blocking the merge queue.

With Mergify. Test Insights reruns at the CI level with attempt-level result tracking. You see that a test passed on attempt 2 of 3, which is exactly the information `rerunFailingTestsCount` throws away. Quarantine kicks in once the pattern is clear.

Detection

Catch every JUnit flake in CI

Surefire and Gradle both emit JUnit XML out of the box. Point Mergify at the XML output of every CI run and Test Insights builds a confidence score for every test on your default branch. PR runs are compared against that baseline. Anything inconsistent gets flagged in a PR comment before the author merges.

mergify ci

# 1. Surefire/Gradle already emit JUnit XML by default. Locate the file:
# Maven: target/surefire-reports/TEST-*.xml
# Gradle: build/test-results/test/TEST-*.xml

# 2. Upload the result (once, in CI)
curl -sSL https://get.mergify.com/ci | sh
mergify ci junit upload target/surefire-reports/TEST-*.xml

Prevention

Block flaky JUnit tests at PR time

On every PR, Mergify reruns the tests whose confidence is below threshold, without Surefire's `rerunFailingTestsCount` touching your config. The PR gets a comment naming the unreliable tests, their confidence history, and whether the failure on this PR is new or historical noise. Authors fix the real bugs before merge instead of re-running CI until it passes.

Mergify Test Insights Prevention view showing caught flaky JUnit tests per PR

Quarantine

Quarantine without skipping

Once a JUnit test is confirmed flaky, Test Insights quarantines it. The test still runs in the suite, no `@Disabled` rewrite required, but its result no longer blocks merges or marks the pipeline red. When the pass rate on main recovers, quarantine lifts automatically and the test goes back to being load-bearing.

Want to see which JUnit tests in your repo are already flaky?

Works with Surefire, Gradle, or any tool that emits JUnit XML. No extra plugins required. Setup takes under five minutes.

Book a discovery call

Frequently asked questions

Why are my JUnit tests flaky in CI but pass locally?: Your laptop and your CI runner differ in CPU count, parallelism, JDK build, and the order JUnit happens to pick for test methods. Tests that race on static state, hold references to a Spring context that another test evicts, or read the system clock surface those issues only under CI's tighter resource budget. Reproduce locally with `mvn test -T 1C -Dsurefire.rerunFailingTestsCount=0` (or the same Surefire config CI uses) to expose the failure, then fix the underlying coupling before pushing.
How do I detect flaky JUnit tests?: JUnit alone cannot tell flaky from broken since each run gives one data point per test. You need to run the same commit multiple times and compare results. Mergify Test Insights does that on every PR and on the default branch, scores each test, and surfaces the tests whose pass rate drops below a confidence threshold.
Does Surefire's `rerunFailingTestsCount` fix flaky tests?: No, it hides them. A test that fails on attempt 1 and passes on attempt 2 is still broken; you have only decided not to look at the failure. Use Surefire's `rerunFailingTestsCount` as a temporary bandage for a test you are actively fixing, never as a permanent policy. For visibility without blocking the merge queue, quarantine instead of retry.
What causes flaky tests in JUnit?: Eight patterns cover most of what we see: method-order assumptions without @TestMethodOrder, static-field state leaking across tests, Surefire forkCount and parallel resources races, Mockito mock leakage under @TestInstance(PER_CLASS), @SpringBootTest context-cache surprises, time-based assertions without an injected Clock, MockWebServer instances that never get shut down, and Surefire rerunFailingTestsCount hiding real bugs. Each is covered above with a minimal reproducer.
How do I quarantine a flaky JUnit test without deleting it?: Mergify Test Insights quarantines the test automatically once its confidence score drops. The test still runs in the suite, but a failing result no longer blocks merges and its noise no longer drowns out real signal. When the test stabilizes on main, quarantine lifts automatically. No `@Disabled`, no commented-out tests, no orphaned files.
How do I make JUnit 5 run tests in a deterministic order?: Annotate the class with `@TestMethodOrder(MethodOrderer.OrderAnnotation.class)` and add `@Order(N)` to each method. But ordering is almost always a smell: a test class that needs ordering is a chain of dependent tests pretending to be independent. Refactor the prerequisite into `@BeforeEach` so each test stands alone, and reserve `@TestMethodOrder` for the rare case (a smoke test that walks an end-to-end happy path) where ordering is the assertion.

Ship your JUnit suite green.

2k+ organizations use Mergify to merge 75k+ pull requests a month without breaking main.

Get started Read the docs

Flaky tests in JUnit. Named, fixed, and quarantined.