How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

ArXi:2605.27310v1 Announce Type: new Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively interleaved image-text generation.