AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

ArXi:2605.22816v1 Announce Type: cross Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene.