We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro

r/LocalLLaMA
AI Hardware

Hey, I work on inference tooling at Mininglamp AI. We needed faster prefill for a 4B VLM running on Apple Silicon. Problem was MLX only does weight-only quant - activations stay FP16 the whole way through. So we wrote Cider, a small SDK that adds W8A8 activation quant on top of MLX. Numbers on M5 Pro (64GB, 307 GB/s), 4516 token context: Quantization Prefill Decode W8A16 (MLX) 2.839s 80.1 tok/s W8A8 (Cider) 2.519s 79.5 tok/s Under the hood it's custom Metal kernels we registered as MLX primitives. At M=4096 the per-channel path runs 1.84x faster than W8A16 on the same shape.