DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

ArXi:2605.28544v1 Announce Type: new Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy.