BarrierSteer: LLM Safety via Learning Barrier Steering

ArXi:2602.20102v2 Announce Type: replace-cross Despite the strong performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a significant obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and theoretically grounded. In this paper, we