Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

ArXi:2606.04923v1 Announce Type: cross Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe