Causal-JEPA: Learning World Models through Object-Level Latent Masking

ArXi:2602.11389v2 Announce Type: replace World models require robust relational understanding to prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations.