AI RESEARCH
Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure
arXiv CS.AI
•
ArXi:2605.27996v1 Announce Type: new Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pressure onto correlated proxies rather than eliminate it, a failure mode we call reward bias substitution. The failure is enabled by a measurement-versus-optimization gap between audit and policy-induced distributions during mitigation evaluation and policy