Try ik_llama.cpp with MTP if you have limited VRAM. You will be pleasantly surprised!

r/LocalLLaMA
Generative AI Open Source AI

Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB (75-80 tok/s), until they actually merged the MTP PR. Then, performance tanked (65-70 tok/s) and was barely above non-MTP. I then decided to try out ik_llama.cpp since it also s