AI RESEARCH

Auditing LLM Benchmarks with Item Response Theory

arXiv CS.CL

ArXi:2605.30504v1 Announce Type: new LLM benchmark labels are frozen at release and silently propagated into downstream benchmarks, errors and all. We