Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

ArXi:2605.26526v1 Announce Type: new Recent defenses for safeguarding open-weight large language models (LLMs) are intended to prevent adversarial usage. Underlying these defenses is an assumption that new harmful behavior is learned through fine-tuning rather than elicited by jailbreaking the model.