Try ik_llama.cpp with MTP if you have limited VRAM. You will be pleasantly surprised!
r/LocalLLaMA
•
Generative AI
Open Source AI
Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB (75-80 tok/s), until they actually merged the MTP PR. Then, performance tanked (65-70 tok/s) and was barely above non-MTP. I then decided to try out ik_llama.cpp since it also s