AI RESEARCH
When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming
arXiv CS.LG
•
ArXi:2606.03238v1 Announce Type: new Reinforcement learning from human feedback (RLHF) makes large-scale post-