AI RESEARCH

What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing

arXiv CS.CV

ArXi:2605.20795v1 Announce Type: new Flow matching based video generative models have been increasingly relying on prepended Vision-Language Models (VLMs) to handle complex, instruction-based video editing. The prevailing assumption underlying this paradigm is that a connector module can seamlessly align the VLM's rich multi-modal reasoning with the original text embedding space of DiTs. However, we hypothesize that this alignment acts as a severe semantic bottleneck, degrading fine-grained structural variables.