Contrastive Representation Regularization for Vision-Language-Action Models

ArXi:2510.01711v3 Announce Type: replace-cross Vision-Language-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive information. To address the issue, we