AI RESEARCH
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
arXiv CS.AI
•
ArXi:2510.00915v4 Announce Type: replace-cross Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $\{0,1\}$, but imperfect verifiers inevitably