AI RESEARCH
Masked Diffusion Vision-Language Models for Temporal Action Localization
arXiv CS.CV
•
ArXi:2605.29858v1 Announce Type: new Temporal action localization (TAL) requires recognizing the target event and localizing its start and end times precisely in untrimmed videos. Recent vision-language formulations improve semantic reasoning and language-conditioned outputs, but their autoregressive decoders still generate tokens from left to right, preventing later semantic evidence from revising earlier timestamp predictions.