EDUCATION & TRAINING

Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

NVIDIA TensorRT Blog

About This Tutorial

Quantization is one of the strongest levers for large-scale inference. By reducing the precision of weights, activations, and KV cache, we can reduce the memory.