AI RESEARCH
Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics
arXiv CS.AI
•
ArXi:2506.06006v3 Announce Type: replace-cross Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP)-effectively captioning the action between frames-is significantly easier than learning