KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

ArXi:2510.02327v2 Announce Type: replace-cross Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper