AI RESEARCH
When Autoregressive Consistency Hurts Safety Alignment
arXiv CS.LG
•
ArXi:2606.04168v1 Announce Type: new Safety alignment in large language models (LLMs) is fragile in part because it is often shallow: fine-tuning mainly reshapes the model's behavior near the first few output tokens. We argue that this phenomenon can be understood through autoregressive consistency, the tendency of next-token prediction to preserve and extend the current response trajectory consistently.