AI RESEARCH
Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards
arXiv CS.CL
•
ArXi:2605.31328v1 Announce Type: new Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT) setting, evidence that it also arises from reinforcement learning (RL) is limited to large, closed-source models, leaving the phenomenon expensive to study and difficult to reproduce. We characterize EM from RL in small, off-the-shelf open-weight models along three axes.