AI RESEARCH
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning
arXiv CS.CV
•
ArXi:2605.21988v1 Announce Type: new Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-