The Evolution of LLM Inference: Decoding algorithms — Part 2
Towards AI
•
Generative AI
This is the second part of the LLM inference article. For the first part please refer to PART 1 link. This article focuses mainly on decoding algorithms: how we moved from naive autoregressive decoding to speculative decoding, multi-head prediction, tree-based verification, draft-free speculative decoding and long-context speculative decoding. Memory and compute optimizations will also be mentioned later, because in real-world inference systems these techniques are used together. Content (Part 2): 🚀 1. Draft-Free Speculative Decoding 📖 2. Long-Context Speculative Decoding 💾 3.