AI RESEARCH

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

arXiv CS.CV

ArXi:2511.20785v3 Announce Type: replace Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we