Benchmarking Visual State Tracking in Multimodal Video Understanding

ArXi:2606.03920v1 Announce Type: new Understanding a video requires than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We