SeeTraceAct: Visibility-Aware Latent Planning from Cross-Embodiment Demonstration Videos

ArXi:2606.02745v1 Announce Type: cross Vision-language-action models (VLAs) are promising general-purpose robot policies, but adapting them to new tasks typically requires costly task-specific teleoperation data. As an alternative, we study one-shot -conditioned VLAs, where a robot policy is conditioned on a single nstration video of an unseen task. We find that existing end-to-end approaches often struggle when successful execution requires precisely localizing small target regions.