Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

ArXi:2606.00095v1 Announce Type: cross Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings.