ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
AI Research
Them boys can cook, one big fix after another! If you're running --sm tensor on multi-gpu this is the KV cache quantization fix JohannesGaessler commented 5 days ago This PR implements for the combination of -sm tensor and quantized KV cache. The reason why this doesn't work on master is that the flattening of tensors for the KV cache rotation leads to the loss of shape information which the meta backend cannot handle.