Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection

Niklas Risse, Jing Liu, Marcel Böhme

TL;DR: Recent literature in machine learning for vulnerability detection (ML4VD) consistently treats it as a binary classification problem—assessing if a function contains a security flaw—without considering necessary context. Our analysis reveals that this approach often misrepresents vulnerability due to overlooked variable calling contexts, leading to spurious correlations and misleadingly high accuracies in datasets. We suggest redefining ML4VD's problem statement and developing better benchmarks to more accurately assess its effectiveness in identifying genuine security vulnerabilities.