Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

ArXi:2605.22568v1 Announce Type: cross The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncertainty. We then outline practical directions toward building robust and trustworthy evaluation frameworks.