Flash Attention for llama.cpp on RDNA3: 47% less KV VRAM than Vulkan f16 K, KLD almost losselss on F16 K / q4_0 V. Part 1.

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

The normal tradeoff in llama.cpp attention is: quantize your KV cache and lose quality, or keep fp16 and burn VRAM. On RDNA3 there's a third option(from now on)! Pack four 8-bit K values into a single 32-bit and feed them directly to the GPU's native `sudot4` dot-product instruction. No lossy quantization of K. No fp16 K buffer sitting in memory. The kernel gets exactly the data layout it needs, and VRAM drops because you're storing 8-bit K payloads plus fp16 scales instead of full fp16 K tensors.