AI RESEARCH
MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models
arXiv CS.AI
•
ArXi:2606.01060v1 Announce Type: cross Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and retrieval-time corruption, suggesting behavior-level evaluation alone is incomplete. Post-