Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations

ArXi:2605.25620v1 Announce Type: new World models enable agents to predict future dynamics conditioned on actions, making the choice of latent representation central to planning and control. Such representations are often either learned directly from pixels with limited semantic structure or inherited from frozen visual foundation models with excessive task-irrelevant detail, yielding state spaces that are poorly matched to downstream planning and control.