AI RESEARCH
Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation
arXiv CS.LG
•
ArXi:2605.24535v1 Announce Type: cross Jailbreak prompts can trigger harmful completions on aligned LLMs, In accordance, safety steering has been proposed: test-time activation interventions that steer jailbreak activations to trigger refusal while preserving benign utility. However, existing steering methods are fundamentally supervised and tied to a static, limited