EDUCATION & TRAINING
Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache
NVIDIA TensorRT Blog
About This Tutorial
Quantization is one of the strongest levers for large-scale inference. By reducing the precision of weights, activations, and KV cache, we can reduce the memory.