AI RESEARCH

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

arXiv CS.LG

ArXi:2605.25189v1 Announce Type: new Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we.