Qwen3.6 35B-A3B MTP hits 249 t/s on a 24GB consumer GPU (RTX 5090M) — 3.4× the dense 27B variant on the same image

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

Sharing this because I didn't believe the first run. Setup: laptop-class RTX 5090 (24GB, sm_120 Blackwell, ~896 GB/s), Linux. Pulled unsloth/Qwen3.6-35B-A3B-MTP-GGUF UD-Q3_K_XL (17.2 GB on disk) on ggml-org/llama.cpp master from a few days ago - the cut that includes am17an's MTP merge, ggergano's n_max=3 default cleanup, and the NVIDIA backend sampling work (, merged 2026-05-20