Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.
r/LocalLLaMA
•
Generative AI
Open Source AI
Here's the PR by pedapudi. It's merge request has been denied so it will not be in mainline llama.cpp. The changes are so small that I just put them into whatever the current release of llama.cpp is. Read the PR for info. It will only work with MOEs. Also, it gives the most boost at low context. As the context rises, the gain diminishes. Pedapudi explains why that happens in the PR. Here are some numbers. It really works well. The tiny amount of time it takes me to apply the code to the current release of llama.cpp is time well spent.