Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context.
r/LocalLLaMA
•
Generative AI
Open Source AI
AI Tools
Used the vllm version of It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090? The project is starting to seem very bloated, at least readme wise. I use, I get 60tks with long context, I get 60tks but with context filling up fast it drops to 20tks, on mainline llama.cpp and q4 cache Are there any better options, and what is your experience? EDIT: Using Qwen 3.6 27b Q4 EDIT: I use MTP on mainline ase described above, context is max 4k at good speed on Q4 cache. submitted by /u/GodComplecs [link] [comments.