Flaky Tests In CI: Detection And Prevention

So, you've pushed a green PR. You merge it. Ten minutes later the deploy pipeline fails on a test that has nothing to do with your change. You re-run the job. It passes. You go back to whatever you were doing, and a quiet part of your brain marks that test as "the one that does that sometimes."

That's a flaky test. And once a codebase has one, it has dozens, because the moment a team learns that "just hit re-run" is a valid response to a CI failure, every test on the boundary of stability gets quietly granted the same allowance. The signal value of red collapses. People stop reading failures. Real bugs slip through because everyone's eyes glaze over at the eighth failure of the week. The culture of "green means safe to merge" is the first casualty.

Flaky tests are not a hygiene problem. They're a feedback-system problem. When the test suite can't tell you the difference between "your change broke something" and "the universe was in a mood," the suite has stopped being a test suite and become a coin flip with a CI badge attached.

Let's get specific about why this happens and what to do about it. Not at the "add retries and quarantine the flakes" level, that's the cargo cult answer. We're going to look at what actually causes flakes (it's not what most retry policies assume), how to detect them statistically instead of by gut feel, and the isolation, data, and parallelism patterns that prevent new ones from being born in the first place.

What "Flaky" Actually Means

A flaky test is one that produces different outcomes, pass or fail, on the same code, in the same environment, without the test itself or the code under test changing. That's the definition. It sounds obvious. The reason it matters is that "same environment" hides almost every interesting failure mode.

Two test runs are never truly in the same environment. They share the source tree, but they don't share:

The exact moment they ran (timezones, clock skew, daylight-saving boundaries, leap seconds).
The state of any shared resource, a database row, a Redis key, a temp file, a port number.
The order other tests ran before them in the same process.
The amount of memory and CPU the runner happened to have free.
The version of any service the test reaches out to over the network.
The random seed used by the test framework or the code under test.

A "flaky" test is one whose pass/fail depends on at least one of these uncontrolled inputs. The test isn't lying, it's accurately reporting that your code's behavior depends on something you didn't realize it depended on. Which means most flakes are real bugs in disguise, just bugs about state and timing and ordering rather than logic.

That reframe matters because it changes what you're going to do about it. If a flake is "a bad test," you fix it by adding retries or weakening the assertion. If a flake is "a real bug about a hidden input," you fix it by finding the hidden input and either controlling for it or making the test loud about it. These two responses produce very different codebases six months out.

The Four Kinds Of Flake

Almost every flaky test in real CI falls into one of four buckets. They look identical from the outside, a red bar that turns green when you re-run, but they fail for completely different reasons, and the fix for one is the wrong fix for another. If you don't separate them, you'll end up adding retries to a test that needed proper teardown, or quarantining a test that was correctly reporting a race condition in production code.

Four kinds of flake: a 2x2 taxonomy comparing timing/order, shared state, external dependency, and nondeterminism

Bucket one, timing and order. A test passes when step A finishes before step B and fails when it doesn't. The classic example is a UI test that asserts on an element being visible without waiting for it to render. On a fast CI runner the element appears in 20ms and the test catches it; on a slow runner under load it appears in 200ms and the test fails. Same code, same test, different timings.

Order-dependent flakes are the same family, a test passes when it runs after another test that warmed up some cache or created some row, and fails when the runner randomizes the order. These flakes are usually exposed when someone enables --randomize on the test runner for the first time and three hundred tests light up.

Bucket two, shared state. A test reads or writes something another test also reads or writes, and the order of access changes the outcome. Classic offenders: a global database without transactional isolation, a singleton in-memory cache, a temp file with a hardcoded path, a localStorage value, an environment variable mutated mid-test, a port number that two tests both try to bind. The first test leaves the shared resource in a state the second test didn't expect.

This is the flake that explodes when you turn on parallel execution. Tests that were quietly sharing state when they ran sequentially suddenly stomp on each other's data, and what was a stable suite becomes a casino.

Bucket three, external dependency. The test reaches out to something it doesn't control, a third-party API, a DNS resolver, an OS package mirror, a container registry, a Selenium grid, a flaky internal microservice, and that thing fails or slows down. The test is correctly reporting that the dependency was unreliable. The test's only sin is having a real network in its call graph.

Bucket four, nondeterminism in the code or the data. The code under test or the test setup uses something genuinely random, Math.random(), time.Now(), uuid.uuid4(), an unordered map iteration, a hash set being compared element-by-element instead of as a set. The test's assertion sometimes happens to match the random output and sometimes doesn't.

Each bucket has a different fix, and using the wrong one makes things worse:

Timing/order → make the test wait on a condition, not a duration. Add proper polling. Eliminate sleep() from test code with extreme prejudice.
Shared state → isolate the resource. Per-test transactions, per-test schemas, per-test temp dirs, per-test ports.
External dependency → stub the dependency at the boundary, or move that test into a separate "integration" tier that's allowed to fail noisily without blocking deploys.
Nondeterminism → seed the randomness. Inject the clock. Compare unordered collections as sets. Treat "this could come back in any order" as a feature of the assertion, not a bug to retry your way past.

Get the bucket wrong and you patch over a symptom while the root cause keeps producing new flakes elsewhere.

Detection: Stop Trusting Your Gut

Most teams "detect" flakes by remembering. Someone mentions in standup that the auth tests have been weird this week. Someone else nods. Maybe a Slack message goes out. This is approximately as reliable as detecting outages by checking which engineer looks tired.

A flake is a statistical property of a test, not a binary one. The right tool is a rerun history. Every CI run records: which tests ran, which passed, which failed, and on a re-run, which changed result without a code change. Aggregate this over a week or a month and you get a per-test flake rate, the probability that a given test fails on a build whose code is otherwise green.

You don't need a fancy platform to start. A nightly job that scrapes the CI API and writes one row per test result into a small table gets you 80% of the way. Once you have the data, sort by flake rate descending, and you've got your fix-or-quarantine list.

Here's the shape of the detection logic in a few languages, all doing the same thing, pull recent test results and flag tests whose failure rate is above some threshold while their pass rate is also non-trivial (a test that always fails isn't flaky, it's broken).

type Result = { test: string; passed: boolean; ranAt: Date };

function flakeRate(results: Result[]) {
  const byTest = new Map<string, Result[]>();
  for (const r of results) {
    if (!byTest.has(r.test)) byTest.set(r.test, []);
    byTest.get(r.test)!.push(r);
  }

  const flakes: { test: string; runs: number; failRate: number }[] = [];
  for (const [test, runs] of byTest) {
    if (runs.length < 20) continue; // not enough signal
    const fails = runs.filter((r) => !r.passed).length;
    const failRate = fails / runs.length;
    // flaky window: fails sometimes but not always
    if (failRate >= 0.02 && failRate <= 0.5) {
      flakes.push({ test, runs: runs.length, failRate });
    }
  }
  return flakes.sort((a, b) => b.failRate - a.failRate);
}

from collections import defaultdict
from dataclasses import dataclass
from datetime import datetime

@dataclass
class Result:
    test: str
    passed: bool
    ran_at: datetime

def flake_rate(results: list[Result]) -> list[tuple[str, int, float]]:
    by_test: dict[str, list[Result]] = defaultdict(list)
    for r in results:
        by_test[r.test].append(r)

    flakes = []
    for test, runs in by_test.items():
        if len(runs) < 20:
            continue
        fails = sum(1 for r in runs if not r.passed)
        fail_rate = fails / len(runs)
        if 0.02 <= fail_rate <= 0.5:
            flakes.append((test, len(runs), fail_rate))

    return sorted(flakes, key=lambda x: x[2], reverse=True)

type Result struct {
    Test   string
    Passed bool
    RanAt  time.Time
}

type Flake struct {
    Test     string
    Runs     int
    FailRate float64
}

func FlakeRate(results []Result) []Flake {
    byTest := map[string][]Result{}
    for _, r := range results {
        byTest[r.Test] = append(byTest[r.Test], r)
    }

    var flakes []Flake
    for test, runs := range byTest {
        if len(runs) < 20 {
            continue
        }
        fails := 0
        for _, r := range runs {
            if !r.Passed {
                fails++
            }
        }
        rate := float64(fails) / float64(len(runs))
        if rate >= 0.02 && rate <= 0.5 {
            flakes = append(flakes, Flake{test, len(runs), rate})
        }
    }
    sort.Slice(flakes, func(i, j int) bool {
        return flakes[i].FailRate > flakes[j].FailRate
    })
    return flakes
}

The cutoffs matter less than having them at all. The point is to move flake conversations from "I think the auth suite has been weird" to "auth/login_spec.ts:should keep session across reload failed in 14 of 217 runs over the last 30 days; here's a link to every failure log." That second sentence is actionable. The first one is gossip.

A few common platforms have first-class support for this and you should use it before building your own:

GitHub Actions doesn't surface flake rates natively, but the test reporter actions can attach JUnit XML to a build, and aggregating that across runs is a Lambda function and a small dashboard away.
GitLab CI has Flaky test detection built into Premium tiers, it reads JUnit reports and flags tests whose status changed without code changes.
CircleCI has Flaky test detection in its Insights tab, same idea, same JUnit dependency.
Jenkins has the Flaky Test Handler plugin and the broader Test Results Analyzer plugin for trend visualization.
Buildkite has Test Analytics as a separate product that ingests results from any runner and computes per-test reliability.

The pattern under all of them is the same: emit structured test results (JUnit XML is the lingua franca), persist them across runs, and compute reliability over a sliding window. If your platform doesn't do it for free, a hundred lines of Python and a Postgres table will.

Retries: A Tool, Not A Policy

Test retries are the most-reached-for and most-misused flake response in the industry. They're not bad, they have a legitimate role, but the way most teams configure them turns the suite from a signal into a slot machine.

The honest framing: a retry is information loss. When you run a test once and it fails, you have one bit of data, "this failed." When you run it three times and accept the first pass, you have approximately zero bits of data, because you've thrown away the information that it failed at all. You've smoothed out the surface that was trying to tell you about a real problem.

That doesn't mean retries are wrong. It means you have to decide what you're using them for.

Three legitimate uses:

Decouple deploys from infrastructure noise. A git clone, an npm install, a docker pull, or a Selenium grid spin-up genuinely can fail for reasons unrelated to your code. Retrying setup steps, not test bodies, buys you reliability without hiding real signal. Most CI platforms support retries at the job level for exactly this; use that, not test-level retries.
Confirm a suspected flake. A test that failed on the first try and passed on the second is interesting. You can configure the runner to report both results and surface the discrepancy. Buildkite's auto-retry is configured this way by default, the test is reported as flaky, not passed, so the test was useful, you didn't lose the signal, and the build can still ship if your policy allows.
Smooth deploys while you fix the root cause. A test that you know is flaky, that you've filed a ticket for, and that you've given a deadline to fix can be retried to keep the team unblocked in the short term. The danger here is the ticket sits in the backlog for a year and the retry becomes permanent. If you do this, write the retry config so it expires, comment it with a date, or better, add an automated lint rule that fails the build if a retry annotation is older than 30 days.

The wrong use, which is by far the most common, is "retry everything three times so green is more common." Here's what that actually does:

A test with a 10% flake rate (fails one run in ten) now passes after retries with probability 1 - 0.1³ = 99.9%. Your suite looks reliable.
The real bug, a race condition that fires under load 10% of the time, also fires in production 10% of the time. Customers see it. You don't.
A regression that breaks the test 100% of the time still fails after three retries, so you still catch hard breaks. This is the part that makes retries feel safe, they don't hide every kind of failure.
They hide exactly the failures that matter most: the intermittent ones, which are usually about state and timing, which are usually production bugs.

The shape of a sane retry configuration is narrow. Retry only specific tests that you've tagged as known flakes, not the whole suite. Report retried tests as flaky in the build summary so they don't vanish. Keep the retry count low, two attempts is plenty; three is suspicious; five is a confession that you've given up.

Here's roughly what that looks like across a few common test runners.

For Jest, retries are per-test via jest.retryTimes(), ideally scoped:

JavaScript describe-block.test.js

describe("flaky integration: payment webhook", () => {
  // Tracked in TICKET-1421; remove this by 2026-07-01.
  jest.retryTimes(2, { logErrorsBeforeRetry: true });

  it("acks the webhook within 5s", async () => {
    // ...
  });
});

For pytest, the pytest-rerunfailures plugin lets you mark a single test:

Python test_webhook.py

import pytest

# Tracked in TICKET-1421; remove this by 2026-07-01.
@pytest.mark.flaky(reruns=2, reruns_delay=1)
def test_webhook_ack_within_5s():
    ...

For Go, the standard testing package doesn't have built-in retry, which is mostly a feature, the Go community pressure is to fix the test. If you must, wrap with testing.Run loops and an explicit attempt counter, and resist the temptation to bake it into a helper that everyone reaches for.

For Playwright, retries are configured per-project and respected for the whole suite, useful for E2E where infrastructure noise dominates, dangerous for unit tests:

TypeScript playwright.config.ts

import { defineConfig } from "@playwright/test";

export default defineConfig({
  // E2E only: account for browser launch + page load variability.
  // Failing tests are reported as "flaky", not "passed".
  retries: process.env.CI ? 2 : 0,
});

Isolation: The Boring Fix That Works

If you only have time to fix one category of flake, fix isolation. Most "shared state" flakes, bucket two from the taxonomy, are caused by tests treating expensive resources as if they were free. A test that creates a row in the users table and never cleans up. A test that writes to /tmp/cache.db and another test that reads from /tmp/cache.db. A test that mutates process.env and forgets to restore it.

The fix isn't subtle. Every test gets its own slice of every resource it touches, and the slice is destroyed when the test ends. The fixes are slightly different per resource:

Database, wrap each test in a transaction that rolls back. Almost every test framework has a pattern for this. The test starts a transaction, the test code does its inserts and updates, the test assertion runs, and the transaction is rolled back regardless of pass or fail. The next test starts with the same baseline.

import { db } from "../src/db";

beforeEach(async () => {
  await db.query("BEGIN");
});

afterEach(async () => {
  await db.query("ROLLBACK");
});

import pytest
from myapp.db import connection

@pytest.fixture(autouse=True)
def db_transaction():
    conn = connection()
    trans = conn.begin()
    try:
        yield conn
    finally:
        trans.rollback()

func withTx(t *testing.T, fn func(*sql.Tx)) {
    t.Helper()
    tx, err := db.Begin()
    if err != nil {
        t.Fatal(err)
    }
    defer tx.Rollback()
    fn(tx)
}

func TestCreateUser(t *testing.T) {
    withTx(t, func(tx *sql.Tx) {
        u, err := repo.CreateUserTx(tx, "alice@example.com")
        if err != nil {
            t.Fatal(err)
        }
        if u.Email != "alice@example.com" {
            t.Fatalf("got %q", u.Email)
        }
    })
}

The transactional approach has a known limitation: it can't test code that itself manages transactions, or code that depends on a row being visible to a parallel connection (some background-job patterns do this). For those cases, fall back to per-test schemas, each test gets a freshly-created schema with a randomized name, and the schema is dropped at the end. Slower, but it gives you true isolation.

Filesystem, give each test its own temp directory. Don't reach for /tmp/foo.txt. Use the framework's temp-dir fixture, which creates a fresh path per test and cleans it up afterward.

def test_writes_a_file(tmp_path):
    path = tmp_path / "out.txt"
    write_report(path)
    assert path.exists()

func TestWritesAFile(t *testing.T) {
    dir := t.TempDir() // auto-cleaned after the test
    path := filepath.Join(dir, "out.txt")
    if err := WriteReport(path); err != nil {
        t.Fatal(err)
    }
    if _, err := os.Stat(path); err != nil {
        t.Fatalf("expected file at %s", path)
    }
}

import { mkdtempSync, rmSync } from "node:fs";
import { tmpdir } from "node:os";
import { join } from "node:path";

let dir: string;
beforeEach(() => {
  dir = mkdtempSync(join(tmpdir(), "writes-"));
});
afterEach(() => {
  rmSync(dir, { recursive: true, force: true });
});

test("writes a file", () => {
  const path = join(dir, "out.txt");
  writeReport(path);
  expect(existsSync(path)).toBe(true);
});

Network ports, let the OS pick. A test that binds to port 8080 will pass on a clean runner and fail the moment a second test in the same job also wants port 8080, or a previous test didn't tear down its server cleanly. Bind to port 0, read back the assigned port, and pass that to clients.

Go server_test.go

func TestServerEcho(t *testing.T) {
    lis, err := net.Listen("tcp", "127.0.0.1:0") // 0 = pick any free port
    if err != nil {
        t.Fatal(err)
    }
    defer lis.Close()

    addr := lis.Addr().String()
    go startServer(lis)

    resp, err := http.Get("http://" + addr + "/echo?msg=hi")
    if err != nil {
        t.Fatal(err)
    }
    defer resp.Body.Close()
    // ... assert on resp ...
}

Globals and environment variables, restore in teardown. If a test mutates process.env.FEATURE_FLAG, save the prior value at the top and restore it in afterEach. The cleanest pattern is a small helper that takes a closure:

TypeScript env-helper.ts

function withEnv(vars: Record<string, string>, fn: () => Promise<void>) {
  const prior = new Map<string, string | undefined>();
  for (const key of Object.keys(vars)) {
    prior.set(key, process.env[key]);
    process.env[key] = vars[key];
  }
  return fn().finally(() => {
    for (const [key, value] of prior) {
      if (value === undefined) delete process.env[key];
      else process.env[key] = value;
    }
  });
}

test("respects FEATURE_X", async () => {
  await withEnv({ FEATURE_X: "1" }, async () => {
    expect(await loadConfig()).toMatchObject({ featureX: true });
  });
});

Caches and singletons, reset before each test. If your application has a module-level cache, expose a reset() for tests and call it in beforeEach. If your DI container holds singletons, build a fresh container per test. Singletons are the most-frequent source of "this test only fails when it runs after X", they outlive the test that put data in them.

The mental model: every test starts from a known-good state, ends without leaving a trace, and could be run a million times in any order without different outcomes. If a test can't survive that, the test is telling you about a bug, usually in the production code, sometimes in itself.

Test Data: Don't Build A House On Sand

Half of all flakes I've seen in real codebases trace back to one of three test-data sins:

The test uses a hardcoded date or ID that was valid when the test was written but no longer is. "User #42 should have 3 orders." Then someone runs a data migration and User #42 has 4 orders. The test was never independent of the seed data, it was bolted on top of it.
The test depends on order, count, or stable iteration that the production code doesn't guarantee. "The first item in the list should be Alice." Then the underlying query swaps to a different index, the natural ordering changes, and Alice is now second. The production behavior wasn't wrong, the test was making an assumption beyond what the contract promised.
The test creates data that another test reads, intentionally or not. "First run the setup that creates the admin user, then run the suite that depends on them." This is a global setup dressed up as a fixture, and the moment someone runs the suite in a different order or in parallel, the whole thing collapses.

The patterns that prevent these:

Build data inside the test. Don't lean on a "fixture file" or a "seed script" that was loaded once and is assumed to still match. Every test that needs a user creates a user, every test that needs three orders creates three orders. It's more code per test. It also means the test is complete, reading it tells you exactly what state the system was in when the assertion ran.

A factory pattern (think factory_bot in Ruby, factory-bot or fishery in TS, factory_boy in Python) is the right abstraction. Each factory call returns a fresh object with sensible defaults and accepts overrides for the fields that matter to this test:

import { Factory } from "fishery";
import type { User, Order } from "../src/types";

export const userFactory = Factory.define<User>(({ sequence }) => ({
  id: sequence,
  email: `user${sequence}@example.com`,
  createdAt: new Date(),
}));

export const orderFactory = Factory.define<Order>(({ sequence, params }) => ({
  id: sequence,
  userId: params.userId ?? userFactory.build().id,
  totalCents: 1000,
  status: "pending",
}));

// in a test
const user = userFactory.build({ email: "alice@example.com" });
const order = orderFactory.build({ userId: user.id, totalCents: 5000 });

import factory
from myapp.models import User, Order

class UserFactory(factory.Factory):
    class Meta:
        model = User
    email = factory.Sequence(lambda n: f"user{n}@example.com")

class OrderFactory(factory.Factory):
    class Meta:
        model = Order
    user = factory.SubFactory(UserFactory)
    total_cents = 1000
    status = "pending"

# in a test
user = UserFactory(email="alice@example.com")
order = OrderFactory(user=user, total_cents=5000)

Pin time and randomness. Tests that touch "now" should freeze it. Tests that consume randomness should seed it. Both let you write assertions that are deterministic without weakening them.

For time: pass a clock into the production code instead of letting it call Date.now() directly. The test injects a fake clock that returns whatever timestamp it wants. For randomness: do the same for the RNG source. Both patterns also pay off in production, code that takes a clock as a dependency is easier to reason about under daylight-saving boundaries, leap seconds, and timezone migration.

If you can't refactor the production code, the second-best option is a framework-level shim. Jest has jest.useFakeTimers(). Sinon has useFakeTimers. Go has the third-party github.com/benbjohnson/clock package. Python has freezegun. Use them.

Assert on what the contract promises, not what the implementation happens to do. If your code returns a list and the contract is "the items, in any order," compare as a set, not as an ordered list. If your code returns a timestamp with millisecond precision and you only care about the second, round. If your code returns a UUID and you only care that it's a valid UUID, match a pattern instead of an exact string.

test("returns all admins", async () => {
  const result = await listAdmins();
  // contract: admins are returned in no particular order
  expect(new Set(result.map((u) => u.email))).toEqual(
    new Set(["alice@example.com", "bob@example.com"])
  );
});

def test_returns_all_admins():
    result = list_admins()
    # contract: admins are returned in no particular order
    assert {u.email for u in result} == {"alice@example.com", "bob@example.com"}

func TestReturnsAllAdmins(t *testing.T) {
    result, err := ListAdmins()
    if err != nil {
        t.Fatal(err)
    }
    got := make(map[string]bool)
    for _, u := range result {
        got[u.Email] = true
    }
    want := map[string]bool{
        "alice@example.com": true,
        "bob@example.com":   true,
    }
    if !reflect.DeepEqual(got, want) {
        t.Errorf("got %v, want %v", got, want)
    }
}

Over-specifying is the most common own-goal in test data. Every extra field in your assertion is a chance for an unrelated change to break the test. "Did this method return the right user?" is usually answered fine by checking id and email; you don't need to assert that createdAt is exactly 2024-01-01T00:00:00.000Z.

Turning on parallel test execution is the single most efficient way to discover every shared-state bug your suite has been quietly tolerating. It's also the single most efficient way to make a previously-stable suite go red for a week while you fix them.

The mechanics matter. Different runners parallelize at different granularities:

Jest runs each test file in its own worker process by default. Tests inside the same file run serially.
Vitest matches Jest's behavior by default; per-test parallelism is opt-in via test.concurrent.
pytest runs serially by default; pytest-xdist distributes test files across workers.
Go runs tests in different packages in parallel by default; tests in the same package run serially unless they explicitly call t.Parallel().
PHPUnit runs serially; paratest distributes test classes across workers.
Playwright runs each file in parallel by default and tests within a file in series unless you opt in.

Notice the pattern: nearly every runner parallelizes at the file or class boundary, not the individual test boundary. This is deliberate, the framework authors learned the hard way that random test-level parallelism breaks too many existing suites. But it also means that two tests in the same file can quietly assume sequential ordering and shared state, and the framework will let them.

The flakes that emerge when you parallelize are almost always one of these:

Two tests bind the same port; one wins, the other fails.
Two tests insert a user with the same email; one inserts cleanly, the other gets a uniqueness violation.
Two tests write to the same temp file; the contents depend on which one wrote last.
A test reads from a singleton cache that another worker mutated.
A test asserts on the count of rows in a table that a parallel worker just doubled.

The fixes are the isolation patterns from earlier: per-test resources, no globals, no fixed identifiers. The trick is finding the broken tests once you've turned parallelism on. The honest advice is to roll it out gradually:

Start with file-level parallelism only. Most suites can handle this, files are usually already self-contained, more or less.
Watch for which files newly flake. These are the ones that were sharing state with another file (usually via a shared database row or a shared temp file).
Fix the shared-state offenders one at a time. Don't try to turn on the whole thing and then debug everything red at once.
Only then consider intra-file parallelism. This is where you'll find the gnarliest bugs, globals, singletons, cached connections that two tests in the same file expected to have to themselves.

A practical trick when adding parallelism: run the suite with the runner's randomize-order flag enabled at the same time. Random order surfaces test-order dependencies that pure parallelism might miss (because parallelism shuffles files but tests within a file still run in declared order). Together they're the fastest way to flush out a suite's hidden assumptions.

Bash pytest randomize and parallelize

# pytest-randomly + pytest-xdist
pytest -p randomly --randomly-seed=last -n auto

Bash jest with random order and serial worker

# Jest doesn't have a true random-order flag for tests within a file,
# but you can shuffle file order with a custom sequencer and pin the seed.
jest --testSequencer=./tests/random-sequencer.js --maxWorkers=4

Bash go random test order

# Go 1.17+: -shuffle randomizes test execution order, with a printed seed
# so a failure can be reproduced exactly.
go test -shuffle=on ./...

The -shuffle=on output looks like this and the seed is the part you grep for when a parallel-only flake fires:

Text

-test.shuffle 1718053123459728000
ok      github.com/example/pkg/users    0.043s

Re-run with -shuffle=1718053123459728000 to reproduce the same order. This is one of the more underrated tools in the Go test toolbox.

Quarantine: The Honest Middle Ground

Sometimes you find a flake, you can't fix it today, and the test is still worth running for its non-flaky signal. The right move isn't to delete it and it isn't to retry it into silence, it's to quarantine it: keep running it, keep tracking its results, but stop letting it block the build.

The quarantine pattern is small and useful. Tag the test as flaky. The CI pipeline runs flaky tests separately, reports their results to a dashboard, but doesn't fail the build when they fail. A weekly job ranks quarantined tests by how often they flake and by how long they've been quarantined, and someone picks the worst offender to actually fix.

The danger is the same as with retries: quarantine becomes a graveyard. Tests that "we'll fix next sprint" sit in quarantine for two years, regressions land that turn them from flaky to permanently broken, and nobody notices because nobody reads the quarantine report anymore. The cure is the same as for retries, make quarantine expire. A test that's been quarantined for more than 60 days either gets fixed or gets deleted. No third option. If a test isn't worth fixing in 60 days, it isn't worth running.

The simplest implementation is a test annotation plus a CI step that splits the run:

TypeScript auth.test.ts

import { test } from "@playwright/test";

// QUARANTINED 2026-04-10 — TICKET-1421 — expires 2026-06-10
test.fixme("logs in via SSO and persists session across reload", async ({ page }) => {
  // ... existing test body ...
});

Python test_auth.py

import pytest

# QUARANTINED 2026-04-10 — TICKET-1421 — expires 2026-06-10
@pytest.mark.quarantined
def test_logs_in_via_sso_and_persists_session():
    ...

A small CI step parses the source for QUARANTINED YYYY-MM-DD markers and fails the build if any are older than 60 days. The team can't ignore the quarantine because the build itself starts pointing at it.

What Good Looks Like

After a couple of months of treating flakes seriously, the difference is visible in three places. The CI failure rate on main drops, not because all the tests are passing but because the tests that fail are reporting real bugs instead of phantom ones. The team starts reading test failures again, because failures mean something. Re-running a failed build becomes rare, then surprising, then mildly suspicious.

The hardest part is the cultural shift, not the technical one. The technical fixes are mostly small and boring, wrap a test in a transaction, replace a sleep() with a waitFor(), freeze the clock, give each test its own port. The cultural fix is convincing a team that "hit re-run" is a smell, not a workflow, and that every minute spent investigating a flake is a minute spent making the next deploy safer.

Tests don't have to be flaky. The default isn't "some flakiness is inevitable in a real-world suite", that's a story we tell ourselves to avoid the work. The default is "every test reports something true about the code, every time, in any order, in parallel." That's achievable. It's not even that expensive, once you stop paying the daily tax of pretending the flakes will fix themselves.

Flaky Tests In CI: Detection And Prevention

What "Flaky" Actually Means

The Four Kinds Of Flake

Detection: Stop Trusting Your Gut

Retries: A Tool, Not A Policy

Isolation: The Boring Fix That Works

Test Data: Don't Build A House On Sand

Quarantine: The Honest Middle Ground

What Good Looks Like

Let’s make something great together

Links

Contacts

What "Flaky" Actually Means

The Four Kinds Of Flake

Detection: Stop Trusting Your Gut

Retries: A Tool, Not A Policy

Isolation: The Boring Fix That Works

Test Data: Don't Build A House On Sand

Parallel Execution: Where Hidden Sharing Surfaces

Quarantine: The Honest Middle Ground

What Good Looks Like

You might also like

Designing CI/CD Pipelines That Developers Trust

AI-Generated Code Needs Better Tests, Not Less Review

Building AI Guardrails Into Development Workflows

Let’s make something great together