MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

ArXi:2605.21954v1 Announce Type: new Video temporal grounding (VTG), which localizes the start and end times of a queried event in an untrimmed video, is a key test of whether multimodal large language models (MLLMs) understand not only what happens but also when it happens. Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable, while existing remedies either require costly post-