T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

ArXi:2512.21094v2 Announce Type: replace Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts.