Localization then Neutralization: Gradient-guided Token Suppression against Visual Prompt Injection Attack

ArXi:2605.25194v1 Announce Type: new Adversarial images pose a severe security threat to multimodal large language models through prompt injection. Existing defenses largely lack a principled understanding of the underlying mechanisms and struggle to balance efficiency and defense utility. In this work, we show that successful adversarial attacks do not rely on the entire image uniformly but instead depend on a small subset of critical image tokens.