"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

ArXi:2411.02355v4 Announce Type: replace-cross Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks on the entire Llama-3.1 model family.