VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do?
r/LocalLLaMA
•
Generative AI
Open Source AI
AI Tools
Hi - I want to run unsloth dynamic quant on vllm. Why? vllm is giving faster prefill speed - Llama - i get 800-1000 tokens/sec - Vllm - i get 5k-10K tokens/sec Tried using Qwen3.6-35B-A3B FP8 official. Machine is RTX A6000 - ampere 48gb Unsloth q8 quant (on llama testing) gives correct pandas code, even official FP8 sucks Why unsloth quant? For some reason - with my task - writing pandas - unsloth quant at 8bit gives much better results than the official fp8 quant. I dont know why.