Inference-Time Scaling for Joint Audio-Video Generation

ArXi:2606.03183v1 Announce Type: cross Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial