TokTalk: Expressive Real-time Facial Animation from Audio-LLM Tokens

ArXi:2605.31294v1 Announce Type: new Recent advances in Audio-LLMs like GPT-4o have ushered in an era of conversational interaction with language models. Conversational avatars however, still seem robotic in facial expression and conversational flow, in part due to sequential stages of speech recognition, text generation, turn-based text response, speech synthesis, and audio driven facial animation.