Seeking resources to read about llama.cpp server and how offloading works

r/LocalLLaMA
Generative AI Open Source AI

SETUP INFO: Amd R9700 AI PRO. Using llama-cpp server, ROCM docker version. Using the --ngl option to offload. First of all, I'm greatly impressed by how llama-cpp server handles offloading. There's some fucking magic happening here, at least to me. I have 32gb of VRAM so loading in the small models is no problem, but now I'm starting to experiment with models that spill into system RAM, testing tok/sec differences and various quants. I'm currently testing Qwen3 Coder Next. At Q4-KM, this thing weighs in at 45gb in size.