AI RESEARCH
Inference-Time Scaling for Joint Audio-Video Generation
arXiv CS.CV
•
ArXi:2606.03183v1 Announce Type: cross Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial