110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
AI Research
Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB, until they actually merged the MTP PR. Then, performance tanked and was barely above non-MTP. So, I decided to try out ik_llama.cpp since it also s MTP and is apparently better optimized for CPU offloading. I did not expect such a huge speed boost! Before moving on with the benchmark results, here's my PC specs: OS: CachyOS (HIGHLY recommended) GPU: RTX 4070 Super 12GB CPU: AMD Ryzen 7 9700X RAM: 48GB DDR5-6000