Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?

ArXi:2510.10541v2 Announce Type: replace-cross Current benchmarks are inadequate for evaluating progress in reinforcement learning (RL) for large language models (LLMs). Despite recent benchmark gains reported for RL, we find that