CTFExplorer: Evaluating LLM Offensive Agents Through Multi-Target Web CTF Benchmarking

ArXi:2602.08023v3 Announce Type: replace-cross Existing benchmarks for LLM-based offensive security agents use isolated, single-target setups with a known vulnerable service and fixed objective. They measure exploitation effectively, but miss how real Capture-the-Flag (CTF) participants triage unknown surfaces, prioritize targets, and allocate effort under uncertainty. Current evaluations therefore fail to assess strategic reasoning beyond exploitation alone. To address this, we