Summary
Much research on automated program debugging often assumes that bug fix location(s) indicate the faults’ root causes and that root causes of faults lie within single code elements (statements). It is also often assumed that the number of statements a developer would need to inspect before finding the first faulty statement reflects debugging effort. Although intuitive, these three assumptions are typically used (55% of experiments in surveyed publications make at least one of these three assumptions) without any consideration of their effects on the debugger’s effectiveness and potential impact on developers in practice. To deal with this issue, we perform controlled experimentation, split testing in particular, using 352 bugs from 46 open-source C programs, 19 Automated Fault Localization (AFL) techniques (18 statistical debugging formulas and dynamic slicing), two (2) state-of-the-art automated program repair (APR) techniques (GenProg and Angelix) and 76 professional developers. Our results show that these assumptions conceal the difficulty of debugging. They make AFL techniques appear to be (up to 38%) more effective, and make APR tools appear to be (2X) less effective. We also find that most developers (83%) consider these assumptions to be unsuitable for debuggers and, perhaps worse, that they may inhibit development productivity. The majority (66%) of developers prefer debugging diagnoses without these assumptions twice as much as with the assumptions. Our findings motivate the need to assess debuggers conservatively, i.e., without these assumptions.
FAQ
Short Abstract
Setup
How to cite
Artifact Readme
Docker Readme
Download the Artifact
FAQ
Short Abstract
Main Objective
A major challenge in automated debugging research is the practical evaluation of debuggers, e.g., automated fault localization (AFL) methods. Researchers make several experimental assumptions about debugging practice when evaluating debuggers in the lab. These includes assumptions about the debugging settings, e.g., bugs, programs and developers. Most common experimental assumptions include (a) perfect bug understanding (PBU), (b) using fix locations as root cause bug diagnosis, and (c) assuming a single fault location. These assumptions may impact the measured effectiveness of debuggers. Besides, they do not often align with debugging practice (e.g., developer’s expectations). Consequently, these assumptions often lead to a mismatch between debugging evaluations in the lab versus software practice.
To address these concerns, we study the impact of experimental assumptions in debugging practice using controlled experimentation. We conduct a large empirical study to evaluate the impact of the three aforementioned assumptions in debugging practice using controlled experimentation.
Experimental Approach
The main research method employed in this work is controlled experimentation (aka split or A/B testing). Our experiments involved controlled experiments with debuggers and developers. Controlled experimentation is widely used to test new features in software companies (e.g., Netflix and Google) to guide product development and data-driven decisions. To determine the prevalence of each assumption (RQ1), we perform data analysis and manual in-depth study of the literature and bug datasets.
Here is a workflow of our experimental approach:
Workflow of our approach
Prevalence analysis
To evaluate the prevalence of the three assumptions in literature, we surveyed a large number of publications related to the topics of automated fault localization and automated program repair.
Venues we examined and collected literature from
Prevalence in Debugging Literature: We found those assumptions to be highly prevalent in the existing literature: 55% of experiments in the surveyed publications make at least one of the three assumptions.
Prevalence in Literature
Prevalence in Bug Datasets: Similarly, about half (49%) of the bugs in the bug datasets are impacted by at least one of the three assumptions.
Prevalence of in Bug Datasets
AFL and APR Experiments
In a controlled experiment, we analyzed the effectiveness of 19 fault localization techniques under the presence and absence of these assumptions. We analyzed the impact of the assumptions on 2 popular automated program repair (APR) tools. We employed 4 bug datasets (CoreBench, SIR, IntroCLass and Codeflaws) including a high variance of bugs (real, seeded, mutated) and a varying complexity and maturity of programs.
Benchmarks that were used in the AFL and APR experiments
User Study
We conducted a user study with 76 developers to measure the soundness, severity and utility of these assumptions in practice.
In our artifact, we provide the user study questionnaire containing the questions posed to developers about debugging eight (8) buggy programs and 16 debugging diagnoses. We also provide the responses of participants from the user study as well as our analysis of developers’ responses.
Setup and Infrastructure
See how to set up and run the artifact (Artifact README).
- Download the artifact and datasets (MIT licensed)
- Read the full paper (ICSE 2023)
How to cite?
Cite the Paper
@inproceedings{debug-assumptions,
author = {Soremekun, Ezekiel and Kirschner, Lukas and B\"{o}hme, Marcel and Papadakis, Mike},
title = {Evaluating the Impact of Experimental Assumptions in Automated Fault Localization},
booktitle = {Proceedings of the ACM/IEEE 45th International Conference on Software Engineering},
series = {ICSE 2023},
pages = {1-13},
year = {2023},
}
Cite the Artifact
@article{Soremekun2023,
author = "Ezekiel Soremekun and Lukas Kirschner and Marcel Böhme and Mike Papadakis",
title = "{Artifact for Evaluating the Impact of Experimental Assumptions in Automated Fault Localization}",
year = "2023",
month = "1",
url = "https://figshare.com/articles/conference_contribution/Debugging_Assumptions_Artifact/21786743",
doi = "10.6084/m9.figshare.21786743.v6"
}
Who are we?
- Ezekiel Soremekun, Royal Holloway, University of London (RHUL), Egham, United Kingdom (UK)
- Lukas Kirschner, Saarland University, Saarbrücken, Germany
- Marcel Böhme, Max Planck Institute for Security and Privacy (MPI-SP), Bochum, Germany
- Mike Papadakis, Interdisciplinary Centre for Security, Reliability and Trust (SnT), Luxembourg
Links
The ICSE 2023 presentation slides, artifact and abstract for the artifact are publicly available.
Previous related works on Automated Debugging: