Why An AI Model Only Uses 0.34% of The GPU Compute: How GPUs Actually Work, Part 2

Towards AI
Generative AI AI Hardware

Arithmetic intensity, the roofline model, and the LLM-specific consequences of how modern GPUs are built. An H100 SXM5 delivers 989 TFLOPS of dense FP16 tensor compute and 3.35 TB/s of HBM3 bandwidth. When that chip generates a token from a 70B parameters model in FP16, its Tensor Cores run at roughly one third of one percent of their peak rate. The rest of the time, they wait for weights to arrive from memory. Part one walked through how the SIMT execution model and the six-tier memory hierarchy are built.