Single 3090 with Q4 Qwen 27B, context dropped from 137k to 14k with MTP enabled. Is it normal?
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
AI Tools
Note: Latest version of llama.cpp (b4c0549a49be9e6dc59ac9d0a5bc21dbda910774) My run command: ``` llama-server \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --presence_penalty 0.0 \ --min-p 0.00 \ --gpu-layers all \ -m /home/eleung/huggingface/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q4_K_XL.gguf \ -a llama.cpp \ --host 0.0.0.0 \ --cache-type-k q8_0 --cache-type- q8_0 \ --chat-template-kwargs '{"preserve_thinking":true}' \ --flash-attn on ``` The built in web UI shows that context size is 137k. By adding `spec-type draft-mtp --spec-draft-n-max 2`, the reported context size drops to 14k.