EDUCATION & TRAINING
Why your quantized LLM loses its MTP heads and how to keep them
Dev.to Machine Learning
About This Tutorial
The frustrating problem Last month a teammate pinged me with a classic head-scratcher. He'd taken a base model with multi-token prediction (MTP) heads, ran it through a standard quantization pipeline to ship a smaller GGUF for edge inference, and the latency numbers came back worse than expected. The model still generated coherent text, but the speculative decoding speedup he'd built his benchmarks around was gone. We poked around for an hour before the penny dropped. The MTP heads had silently been dropped on the floor during conversion. The base weights survived.