[llama.cpp] Asymmetric KV q8/q4 cache: current caveats and discussion in GGML repo

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

Probably most of you are aware that using anything other than -ctk q8_0 -ct q8_0 / -ctk q4_0 -ct q4_0 as startup options for llama.cpp leads to prompt processing on cpu instead of gpu for cuda at least. E.g. when we use the frequently suggested mix of -ctk q8_0 -ct q4_0 pps tanks. I have discussed this with a prop LLM and it suggested to add some slight modifications to the cuda source code of llama.cpp or use cmake -DGGML_CUDA_FA_ALL_QUANTS=ON. which will take very long.