FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

ArXi:2605.27284v1 Announce Type: cross Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about how those tasks should be executed. However, existing robot datasets usually pair trajectories with coarse goal-level language, leaving execution-critical details such as active arm, approach direction, and contact region unspecified. This limits steerable policy learning and robotic video understanding. We