Video2LoRA: Parametric Video Internalization for Vision-Language Models

ArXi:2606.04351v1 Announce Type: cross Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We