Debugging Flaky Browser Tests

Flaky browser tests are expensive because they steal trust. Once a team believes failures are random, people stop investigating them carefully. The build turns red, someone reruns it, and the real product issue may disappear into the noise.

The way out is not to add blind retries everywhere. Retries can be useful, but only after the test captures enough evidence to explain the failure. A good debugging workflow turns “it failed sometimes” into a specific category: timing, locator drift, test data collision, environment instability, or a real product bug.

Capture Evidence Before Changing Code

When a test flakes, resist the urge to immediately increase a timeout. First, collect what happened. You need the failed step, a screenshot, page URL, visible text near the target area, console errors, failed network requests, and timing information around waits and navigation.

Without that evidence, every fix is a guess. With it, patterns appear. Maybe the button never rendered. Maybe the API returned a 500. Maybe the page navigated twice. Maybe an animation covered the target for 300 milliseconds. Each cause needs a different fix.

Separate Slow From Incorrect

A common mistake is treating every intermittent failure as a timeout problem. Sometimes the app is simply slow and the test needs a smarter wait. But sometimes the app entered the wrong state and no amount of waiting should make the test pass.

Use waits that describe readiness, not arbitrary time. Wait for a user-visible label, a stable URL, a completed network response, or a disabled loading state to disappear. Avoid sleeping for a fixed duration unless you are isolating a bug during investigation.

await orbit.click('Create report');
await orbit.waitForText('Report ready', 30000);
expect(await orbit.hasText('Download CSV')).toBe(true);

This is stronger than waiting five seconds and hoping the report finished. It describes the state the user needs before the next action makes sense.

Check Locator Quality

Flakes often come from locators that accidentally match the wrong element. A selector such as .primary-button may match three buttons. A positional XPath may change when a banner appears. A test ID may be reused across repeated rows.

Intent-first locators reduce this risk because they start with user-facing labels and roles. When text is ambiguous, add context instead of falling back to a fragile selector. If the page has three “Edit” buttons, find the card or row first, then click the edit action inside that context.

Stabilize Test Data

Many flakes are not browser problems at all. They are data problems. Two tests use the same account. A previous run leaves a project behind. A nightly cleanup job removes records while CI is still running. A test assumes today’s date but runs around midnight in another timezone.

Good browser tests create unique data when they can, clean up after themselves when practical, and avoid depending on shared mutable records. When shared state is unavoidable, name it clearly and protect it from parallel runs.

Use Retries as a Diagnostic Tool

Retries are not evil. They are useful when infrastructure occasionally fails, a cloud browser disconnects, or a test depends on an external system with known transient behavior. But retries should be visible in reports. A test that passes only on retry is still telling you something.

Track retry rate by test. If one test retries often, it deserves investigation. If the whole suite retries often, the environment may be overloaded or the app may not expose reliable readiness signals.

Look at CI Conditions

A browser test that never fails locally but fails in CI may be exposing resource pressure. CI machines often have fewer CPU cores, slower disk, different fonts, headless rendering, tighter network rules, or parallel jobs competing for memory. Capture machine details and compare local and CI runtime.

Run the flaky test in isolation, then under parallel load. If it fails only under load, the product may have race conditions, or the test may rely on timing that only works on a fast machine.

Turn Fixes Into Rules

Every flaky test fix should improve future tests. If the cause was a weak locator, add a team rule for locator design. If the cause was missing readiness, add a helper that waits for the app shell. If the cause was shared data, update the fixture pattern.

The best flake work compounds. Over time, the suite becomes quieter because each fix removes an entire class of future mistakes.

The Standard to Aim For

A healthy browser suite can still fail. The difference is that failures are explainable. When a build is red, the report should make it obvious whether the product broke, the environment broke, or the test is no longer describing the right user behavior.

That is why OrbitTest focuses on readable actions, screenshots, traces, smart reports, and CI-friendly output. Flaky tests are part of real automation work. Guessing does not have to be.