Adaptive Dense Evidence Refinement for Video Relational Reasoning for VRR-QA Challenge

ArXi:2606.01104v1 Announce Type: new VRR-QA evaluates whether video-language systems can infer spatial, temporal, viewpoint, depth, and visibility relations that are not always resolved by a single frame. We present an inference-only system built around adaptive test-time computation. The system first answers each question with a direct video-language model pass, then uses multiple lightweight views to find unstable questions.