From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

ArXi:2605.22671v1 Announce Type: new Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through action-centric latent variables, they are often limited by short-horizon temporal fragmentation and static execution-alignment, leading to inconsistent behaviors in complex scenarios.