AI RESEARCH

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

arXiv CS.AI

ArXi:2606.03305v1 Announce Type: new Benchmark contamination, where evaluation examples appear in a model's