GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

ArXi:2605.22812v1 Announce Type: cross Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To address this limitation, we