JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

ArXi:2602.18527v2 Announce Type: replace-cross Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice