Optimizing speed & quality on Qwen3.6 27b
r/LocalLLaMA
•
Generative AI
AI Research
Does the inference speed below seem optimal for the hardware, or could there be further room for improvement? I’ve been trying to use Qwen3.6 27b for agentic harnesses like Pi/Hermes. Because of the long horizon required of agentic tasks, I been trying to maximize speed while retaining as close to full precision as possible. The inference speed can vary widely between ~300-500 tok/s for prompt processing, ~22-30 tok/sec of token generation at a context window of 100k. This is with 40GB of VRAM (1x2060super8gb, 2x5060ti16gb.