VCIFBench: Evaluating Complex Instruction Following for Video Understanding

ArXi:2606.04588v1 Announce Type: new Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We