AI RESEARCH

ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

arXiv CS.AI

ArXi:2605.20251v2 Announce Type: cross Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics provide limited visibility and often miss defects that arise during execution. We present ProcBench, a benchmark for execution-process evaluation in LLM coding agents. ProcBench organizes recurrent execution defects into a reusable ontology covering 11 defect types in 4 categories, and evaluates agent trajectories through standardized process evidence rather than final outcomes alone.