AI RESEARCH

Fast and Expressive Multi-Byte Prediction with Probabilistic Circuits

arXiv CS.LG

ArXi:2511.11346v2 Announce Type: replace Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs), especially in byte-level LLMs, which are tokeniser-free but prohibitively slow. However, many existing MTP methods either assume independence between future tokens, sacrificing expressiveness, or generate tokens one at a time within the window, increasing latency. In this work, we investigate the trade-off between expressiveness and latency in MTP within the framework of probabilistic circuits (PCs