LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

ArXi:2605.22089v1 Announce Type: new Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning.