Abstract: Fuzzing is a widely used automated testing technique that utilizes random inputs to provoke program crashes which indicate security breaches. A difficult but important question is when to stop a fuzzing campaign. Usually, a campaign is terminated if the number of crashes and/or covered code elements has not increased for a certain amount of time. To avoid premature termination in cases where a ramp-up time is needed before vulnerabilities are reached, code coverage is often preferred over crash count as stopping criterion in practice. However, code coverage, as well as crash count, tend to overestimate fuzzing effectiveness, as a campaign may only increase coverage on non-security-critical code or trigger the same crashes over and over again, respectively. This can then lead to an unnecessarily lengthy and thus cost-intensive testing process. In this paper, we explore the trade-off between the amount of saved fuzzing time vs. the number of missed bugs when stopping campaigns based on the saturation of covered, potentially vulnerable functions rather than triggered crashes or regular function coverage. In a large-scale empirical evaluation, comprising 30 open-source C programs with a total of 240 security bugs and 1,280 fuzzing campaigns, we first show that training binary classification models on software with known vulnerabilities (CVEs), using lightweight ML features derived from static application security testing (SAST) tool findings and proven software metrics, enables very reliable prediction of (potentially) vulnerable functions. Second, we show that our proposed stopping criterion, compared to the saturation of crashes and regular function coverage, allows terminating 24-hour fuzzing campaigns 6-12 hours earlier, thereby missing on average less than 0.5 bugs.
Congratulations to the authors on the paper acceptance!