VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

ArXi:2605.22570v1 Announce Type: new Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities. In this paper, we