Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs

ArXi:2606.01620v1 Announce Type: new Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an autoregressive latent denoising model.