4 March 2025
08 Min. Read
How can engineering teams identify and fix flaky tests?
We recently worked with a bunch of beta partners at Trunk to tackle this problem, too. When we were building some CI + Merge Queue tooling, I think CI instability/headaches that we saw all traced themselves back to flaky tests in one way or another.
Basically, tests were flaky because:
The test code is buggy
The infrastructure code is buggy
The production code is buggy.
➡️ Problem 1 is trivial to fix, and most teams that end up beta-ing our tool end up fixing the common problems with bad await logic, improper cleanup between tests, etc.
➡️ But problems caused by 2 makes it impossible for most product engineers to fix flaky tests alone and problem 3 makes it a terrible idea to ignore flaky tests.

That’s one among many incidents shared on social forums like reddit, quora etc. Flaky tests can be caused due to a number of reasons, and you may not be able to reproduce the actual failure locally.
Because its expensive, right!
It becomes really important that your team actually spends the time to identify tests which are actually flaking frequently and focuses on fixing them vs just trying to fix every flaky test event which ever occurred.
Before we move ahead, let’s get some fundamentals clear and then discuss the unique solution we’ve that can fix your flaky tests for real.
The Impact on Business
A flaky test refers to testing that generates inconsistent results, failing or passing unpredictably, without any modifications to the code under testing. Unlike reliable tests, which yield the same results consistently, flaky tests create uncertainty.
Flaky tests cost the average engineering organization over $4.3M annually in lost productivity and delayed releases.
Impact Area | Key Metrics | Industry Average | High-Performing Teams |
Developer Productivity | Weekly hours spent investigating false failures | 6.5 hours/engineer | <2 hours/engineer |
CI/CD Pipeline | Pipeline reliability percentage | 62% | >90% |
Release Frequency | Deployment cadence | Every 2-3 weeks | Daily/on-demand |
Engineering Morale | Team satisfaction with test process (survey) | 53% | >85% |
Causes of Flaky Tests, especially the backend ones:
Flaky tests are a nuisance because they fail intermittently and unpredictably, often under different circumstances or environments. The inability to rely on consistent test outcomes can mask real issues, leading to bugs slipping into production.

Concurrency Issues: These occur when tests are not thread-safe, which is common in environments where tests interact with shared resources like databases or when they modify shared state in memory.
Time Dependency: Tests that fail because they assume specific execution speed or rely on timing intervals (e.g., sleep calls) to coordinate between threads or network calls.
External Dependencies: Relying on third-party services or systems that may have varying availability, or differing responses can introduce unpredictability into test results.
Resource Leaks: Unreleased file handles or network connections from one test can affect subsequent tests.
Database State: Flakiness arises if tests do not reset the database state completely, leading to different outcomes depending on the order in which tests are run.
Strategies for Identifying Flaky Tests
1️⃣ Automated Test Quarantine: Implement an automated system to detect flaky tests. Any test that fails intermittently should automatically be moved to a quarantine suite and run independently from the main test suite.
# Example of a Python function to detect flaky tests
def quarantine_flaky_tests(test_suite, flaky_threshold=0.1):
results = run_tests(test_suite)
for test, success_rate in results.items():
if success_rate < (1 - flaky_threshold):
quarantine_suite.add_test(test)
2️⃣ Logging and Monitoring: Enhance logging within tests to capture detailed information about the test environment and execution context. This data can be crucial for diagnosing flaky tests.
Data | Description |
Timestamp | When the test was run |
Environment | Details about the test environment |
Test Outcome | Pass/Fail |
Error Logs | Stack trace and error messages |
Debug complex flows without digging into logs: Get full context on every test run. See inputs, outputs, and every step in between. Track async flows, ORM queries, and external calls with deep visibility. With end-to-end traces, you debug issues with complete context before they happen in production.

3️⃣ Consistent Environment: Use Docker or another container technology to standardize the testing environment. This consistency helps minimize the "works on my machine" syndrome.
Eliminating the Flakiness
Before attempting fixes, implement comprehensive monitoring:
✅ Isolate and Reproduce: Once identified, attempt to isolate and reproduce the flaky behavior in a controlled environment. This might involve running the test repeatedly or under varying conditions to understand what triggers the flakiness.
✅ Remove External Dependencies: Where possible, mock or stub out external services to reduce unpredictability.
Invest in mocks that work, it automatically mocks every dependency and are built from actual user flows and even gets auto updated as dependencies change their behavior. More about the approach here
✅ Refactor Tests: Avoid tests that rely on real time or shared state. Ensure each test is self-contained and deterministic.
The HyperTest Advantage for Backend Tests
This is where HyperTest transforms the equation. Unlike traditional approaches that merely identify flaky tests, HyperTest provides a comprehensive solution for backend test stability:
Real API Traffic Recording: Capturing real interactions to ensure test scenarios closely mimic actual use cases, thus reducing discrepancies that can cause flakiness.

Controlled Test Environments: By replaying and mocking external dependencies during testing, HyperTest ensures consistent environments, avoiding failures due to external variability.
Integrated System Testing: Flakiness is often exposed when systems integrate. HyperTest’s holistic approach tests these interactions, catching issues that may not appear in isolation.
Detailed Debugging Traces: Provides granular insights into each step of a test, allowing quicker identification and resolution of the root causes of flakiness.
Proactive Flakiness Prevention: HyperTest maps service dependencies and alerts teams about potential downstream impacts, preventing flaky tests before they occur.
Enhanced Coverage Insight: Offers metrics on tested code areas and highlights parts lacking coverage, encouraging targeted testing that reduces gaps where flakiness could hide.
Shopify's Journey to 99.7% Test Reliability

Key Strategies:
Introduced quarantine workflow
Built custom flakiness detector
Implemented "Fix Flaky Fridays"
Developed targeted libraries for common issues
Results:
Reduced flaky tests from 15% to 0.3%
Cut developer interruptions by 82%
Increased deployment frequency from 50/week to 200+/week
Conclusion: The Competitive Advantage of Test Reliability
Engineering teams that master test reliability gain a significant competitive advantage:
30-40% faster time-to-market for new features
15-20% higher engineer satisfaction scores
50-60% reduction in production incidents
Test flakiness isn't just a technical debt issue—it's a strategic imperative that impacts your entire business. By applying this framework, engineering leaders can transform test suites from liability to asset.
Want to discuss your team's specific flakiness challenges? Schedule a consultation →
Related to Integration Testing