SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

ArXi:2605.26548v1 Announce Type: cross Large language models (LLMs) now automated software security tasks, including vulnerability discovery and proof-of-concept (PoC) generation. Existing benchmarks do not faithfully evaluate LLMs in real-world bug hunting scenarios because they rely on fuzzing harnesses, target-specific descriptions, or vulnerability-reproduction tasks. We present SEC-bench Pro, a benchmark for measuring agent bug hunting on critical, high-complexity software systems.