GEM: Generative Supervision Helps Embodied Intelligence

ArXi:2605.28548v1 Announce Type: new Embodied Vision-Language Models (VLMs) have nstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-