PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

ArXi:2605.21414v1 Announce Type: cross Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments.