KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche

r/LocalLLaMA
Generative AI Open Source AI AI Research

Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs: Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects the precision in a vacuum, but also how it interacts with noise from the model itself. All benchmarks were done using my BeeLlama.cpp fork, allowing to include a number of quant types that are not present in mainline llama.cpp: vanilla TurboQuant, TCQ 3-bit/2-bit, and q6_0. TL;DR q5_0 KV is underrated, and same for q5_1 as V cache.