MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

ArXi:2606.02463v1 Announce Type: cross In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality.