On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

ArXi:2601.06329v2 Announce Type: replace-cross Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens.