CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models

ArXi:2605.21854v1 Announce Type: new Vision-Language-Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete-token autoregression (e.g. OpenVLA) and continuous-action flow-matching (e.g. pi-0.5). Yet preference alignment via Direct Preference Optimisation (DPO) -- the de-facto post-