Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

ArXi:2605.28063v1 Announce Type: cross Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we