WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

ArXi:2605.29341v1 Announce Type: new Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use.