vLLM PR adding native HIP W4A16 kernel was merged

r/LocalLLaMA
AI Tools

The performance increase introduced by the PR is awesome. Makes my ROCm rig a lot useful. Numbers from the PR: Kernel dtype max-num-seqs=8 max-num-seqs=32 Triton W4A16 bf16 82.4 tk/s - Triton W4A16 fp16 83.2 tk/s - ExLlama (no bf16) fp16 255.0 tk/s 382.5 tk/s RDNA3 W4A16 (this PR) bf16 205.3 tk/s 382.5 tk/s RDNA3 W4A16 (this PR) fp16 270.2 tk/s 445.7 tk/s EDIT: The numbers are for Qwen3.6-27B-GPTQ-W4A16-G32. See here: PR link submitted by /u/StupidityCanFly [link] [comments]