AI RESEARCH
The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
arXiv CS.AI
•
ArXi:2606.03305v1 Announce Type: new Benchmark contamination, where evaluation examples appear in a model's