AI RESEARCH
Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control
arXiv CS.LG
•
ArXi:2602.07340v2 Announce Type: replace Safety alignment of large language models remains brittle under domain shift and noisy preference supervision. Most existing robust alignment methods focus on uncertainty in alignment data, while overlooking optimization-induced fragility in preference-based objectives. In this work, we revisit robustness for LLM safety alignment from an optimization geometry perspective, and argue that robustness failures cannot be addressed by data-centric methods alone.