SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

ArXi:2605.31148v1 Announce Type: cross Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we.