SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

ArXi:2605.23898v1 Announce Type: new Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception.