GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

ArXi:2605.22558v1 Announce Type: new Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-horizon memory. While these approaches nstrate the importance of geometry for spatial intelligence, they typically treat geometric cues as a shared signal across all visual tokens.