The Evolution of LLM Inference: Decoding algorithms

LLM inference optimization can be understood along three major axes: memory optimization, compute optimization, and decoding algorithms. Compared to memory and compute optimizations, decoding algorithms are often discussed less, even though they are becoming increasingly important for fast LLM serving. This article focuses mainly on decoding algorithms: how we moved from naive autoregressive decoding to speculative decoding, multi-head prediction, tree-based verification, draft-free speculative decoding and long-context speculative decoding.