llama.cpp has a clever trick for speeding up KV cache decode
r/LocalLLaMA
•
Generative AI
Open Source AI
So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one under developer options. This is the setting - as far as I can tell based on the description (haven't looked at the code yet), it basically just re-sends all of the tokens generated by the current response to the KV cache rather than waiting for you to prompt the model again to begin decoding.