Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
AI Tools
I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060. All credit goes to spiritbuun's fork ( github.com/spiritbuun/buun-llama-cpp ) and mudler's APEX quantizations ( huggingface.co/mudler ). Spiritbuun's CUDA optimizations for NVIDIA GPUs - fused MMA fix, TurboQuant, fattn improvements - are what make offloading a 17.3 GB model on a 12 GB card at these speeds possible. Mudler's APEX I-Compact quantization gave me the best perplexity/speed trade-off of any variant I tested.