Can't get over 250TPS on RTX5090 with Qwen3.5-4B

r/LocalLLaMA
Generative AI Open Source AI

My main model is qwen3.6-27b-mtp and I'm getting around 100tps and 2500tps prefill, which is great. I've tried adding a second small model for auxiliary tasks, and even when it's the only model running, it doesn't go over 200-250tps. I'm building llama.cpp and running on docker windows. I've also tried havenoammo/llama:cuda13-server, and get exactly the same performance so I think my build flags are OK. I've also tested with LM Studio and performance is similar.