Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation

ArXi:2605.30916v1 Announce Type: new AI benchmarks have well-documented limitations, with prior work examining contamination, saturation, and construct underspecification. Aggregation has received far less attention: benchmarks are typically summarized by uniformly averaging item-level scores, implicitly treating every test item as equally valuable.