The bf16 grad accumulator that killed our SDXL LoRA training

About This Tutorial

TL;DR: Our SDXL LoRA fine-tune for a Photoroom product photography model trained for six days while silently corrupting its adapter weights. The cause was bf16 gradient accumulation interacting badly with a custom adapter init we'd ported from a paper. Eval scores stayed in the same range the whole time, which is why nobody noticed. The setup We train SDXL LoRAs for product photography categories at Photoroom. Bottles, packaged food, soft goods. Each LoRA is 192MB.