I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.
r/LocalLLaMA
•
Generative AI
Open Source AI
AI Tools
Hey guys, I spent the last few weeks benchmarking Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B locally GGUF, FP8 using both vLLM and llama.cpp. MTP is the inference trick every major lab is quietly adding to their stack right now and the results genuinely surprised me.