Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

ArXi:2605.21988v1 Announce Type: new Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-