Aligning Language Model Benchmarks with Pairwise Preferences

ArXi:2602.02898v2 Announce Type: replace Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find that benchmarks often fail to predict real utility. Towards bridging this gap, we