EDUCATION & TRAINING

Why MTP doesn't speed up your llama.cpp inference (and how to actually fix it)

Dev.to Machine Learning

About This Tutorial

Last week, I spent two days banging my head against a wall. I had just spun up a fresh llama.cpp build with multi-token prediction (MTP), loaded a quantized Qwen3 model, and ran my benchmark suite expecting that sweet 2-3x speedup everyone keeps talking about. The result? Roughly the same tokens per second. Sometimes slower. After a lot of profiling, I figured out what was happening - and it turns out the issue is common than the celebratory benchmark posts suggest. This post is for anyone who's enabled MTP, expected a speedup, and got nothing.