Factored Latent Action World Models

ArXi:2602.16229v2 Announce Type: replace Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and. therefore. struggle in complex environments where multiple entities act simultaneously. This paper