Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

ArXi:2606.03988v1 Announce Type: new Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation.