AI RESEARCH

Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation

arXiv CS.LG

ArXi:2605.24535v1 Announce Type: cross Jailbreak prompts can trigger harmful completions on aligned LLMs, In accordance, safety steering has been proposed: test-time activation interventions that steer jailbreak activations to trigger refusal while preserving benign utility. However, existing steering methods are fundamentally supervised and tied to a static, limited