AI RESEARCH

CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

arXiv CS.LG

ArXi:2606.01695v1 Announce Type: new Adversaries can implant latent harmful behavior by poisoning as few as 1% of fine-tuning examples. The contamination is invisible to every output-level defense: harmful behavior lies dormant in the model's hidden-state geometry and does not appear in generated text until contamination exceeds 7.5%. We