Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model

r/LocalLLaMA
Generative AI AI Hardware Open Source AI AI Tools

I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060. All credit goes to spiritbuun's fork ( github.com/spiritbuun/buun-llama-cpp ) and mudler's APEX quantizations ( huggingface.co/mudler ). Spiritbuun's CUDA optimizations for NVIDIA GPUs - fused MMA fix, TurboQuant, fattn improvements - are what make offloading a 17.3 GB model on a 12 GB card at these speeds possible. Mudler's APEX I-Compact quantization gave me the best perplexity/speed trade-off of any variant I tested.