IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation

ArXi:2605.30230v1 Announce Type: new With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusion-based methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high computational costs that hinder scalability and accessibility of diffusion-based approaches across the research community. To address this, we propose a finetuning-free paradigm that directly performs talking face generation using the pretrained weights of Stable Diffusion and IP-Adapter.