The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

ArXi:2605.30888v1 Announce Type: new Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static