ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED

r/LocalLLaMA
Generative AI AI Hardware Open Source AI AI Research

Them boys can cook, one big fix after another! If you're running --sm tensor on multi-gpu this is the KV cache quantization fix JohannesGaessler commented 5 days ago This PR implements for the combination of -sm tensor and quantized KV cache. The reason why this doesn't work on master is that the flattening of tensors for the KV cache rotation leads to the loss of shape information which the meta backend cannot handle.