KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche
r/LocalLLaMA
•
Generative AI
Open Source AI
AI Research
Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs: Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects the precision in a vacuum, but also how it interacts with noise from the model itself. All benchmarks were done using my BeeLlama.cpp fork, allowing to include a number of quant types that are not present in mainline llama.cpp: vanilla TurboQuant, TCQ 3-bit/2-bit, and q6_0. TL;DR q5_0 KV is underrated, and same for q5_1 as V cache.