AI RESEARCH

Stop Comparing LLM Agents Without Disclosing the Harness

arXiv CS.AI

ArXi:2605.23950v1 Announce Type: new This position paper argues that, for long-horizon tasks evaluated across models with comparable frontier capability, the agent execution harness, namely the infrastructure layer that governs context construction, tool interaction, orchestration, and verification around a language model, is often a stronger determinant of agent performance than the model it wraps. We formalize and defend the Binding Constraint Thesis: in this regime, performance variance is governed by harness configuration than by model choice, and current evaluation protocols.