Flakewatch: Automated Categorization and Detection of Flaky Tests At Scale

Authors

  • Nate Levin Department of Computer Science, George Mason University, Fairfax, VA
  • August Shi Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX
  • Wing Lam Department of Computer Science, George Mason University, Fairfax, VA

Abstract

Automated testing is essential for developing stable and effective software, as it enables developers to implement changes confidently while minimizing the risk of regressions in functionality. However, automated testing suffers from flaky tests, i.e. tests that both pass and fail nondeterministically on the same version of code. Flaky tests can be categorized based on their source of nondeterminism, including dependency on test order, false assumptions about their execution environment, and other sources of differing behavior between executions. While multiple techniques for detecting specific types of flaky tests exist, no tool can combine these detectors to perform categorization. Additionally, while large datasets of flaky tests exist, these tests are often outdated, and thus it can be difficult to reproduce their failures. We present Flakewatch, an automated tool to create a live dataset of categorized flaky tests, based on the output of existing detection tools. Flakewatch’s combination of multiple detection tools’ output helps to avoid false negative detections, where a true flaky test is incorrectly marked as not flaky. Given a project, Flakewatch will continuously pull for commits, then run detectors on tests found to be added or modified. Flakewatch then updates a live database of active flaky tests, including any that failed in continuous integration. Flakewatch contributes to furthering the development of automated flaky test detection tools and may help improve understanding of root causes and solutions to flaky tests.

Published

2024-10-13

Issue

Section

College of Engineering and Computing: Department of Computer Science