Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning

ArXi:2606.00963v1 Announce Type: cross Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimation. In multi-view images and monocular videos, relevant spatial cues are often sparse and distributed across redundant observations, making them difficult to organize and exploit.