Diffusion in prod: how are you handling spiky GPU load and cold starts?
r/LocalLLaMA
•
AI Hardware
We keep running into the same wall scaling diffusion workloads: pipelines that are fine at 100 requests fall apart at 10k. Cold starts quietly kill conversion, GPU costs compound with every model update, and multi-tenancy gets tricky fast. Curious how others are handling this in production: are you over-provisioning, doing custom scheduling, eating the cold-start cost, something smarter? What's actually held up under real load vs. what looked good on paper? submitted by /u/hackyroot [link] [comments.