mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

Hey all! I’ve been working on CUDA performance in mistral.rs, and v0.8.2 is focused on CUDA throughput. The result: on Gemma 4 (dense & MoE), mistral.rs is faster than llama.cpp at every point in my release sweep on GB10/H100/B200. See some results below on GB10 and B200: The full report includes all steps to reproduce these results. The results hold up across quantization type (eQ8_0, Q4K), model (dense and MoE), and.