AI RESEARCH
Auditing LLM Benchmarks with Item Response Theory
arXiv CS.CL
•
ArXi:2605.30504v1 Announce Type: new LLM benchmark labels are frozen at release and silently propagated into downstream benchmarks, errors and all. We