Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

ArXi:2505.17015v2 Announce Type: replace-cross Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with multi-frame spatial understanding by integrating fundamental spatial skills, including depth perception, visual correspondence, and dynamic perception.