AI RESEARCH

How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework

arXiv CS.CL

ArXi:2507.19219v2 Announce Type: replace Overestimation in evaluating large language models (LLMs) has become an increasing concern. Due to the contamination of public benchmarks or imbalanced model