When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

ArXi:2605.27932v1 Announce Type: cross Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why.