MOSS-Audio Technical Report

ArXi:2606.01802v1 Announce Type: cross MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, ing audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs.