Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

r/LocalLLaMA
Generative AI

.and on 8GB VRAM I can even push the context to 320K, 400K, 512K, and yes. 1M. But it does start to slow down noticeably beyond 150k so I'd only do this if I ever really want the larger context. This is using APEX-I-Quality or Q4_K_XL quants both are better than Q4_K_M (IQ4_NL_XL for beyond 512k context). I have a total of 32GB of DDR4-2666 which is slightly above minimum DDR4. I see a lot of users with better GPUs and VRAM seem to be getting less efficiency and have to drop context all the way to 64k or below to run at good tps, I don't understand why.