LLM agents patch security bugs, pass all tests, but still leave the vulnerability open [R]

I built CVE-Bench: 20 real-world CVEs across 18 Python projects (Pillow, GitPython, yt-dlp, urllib3, others), 5 frontier models, 3 prompt conditions, 300 runs total. Each agent runs in a sandboxed container and is scored against a hidden test_security.py derived from the maintainer's own fix. Binary pass/fail (a 90%-patched vulnerability is still a vulnerability). To better understand failure modes, I've tested three prompt conditions: advisory (full GHSA report), diagnose (exploit description only, no file or function), and locate (exact file and function, no description of the flaw.