EDUCATION & TRAINING
The bf16 grad accumulator that killed our SDXL LoRA training
Dev.to Machine Learning
About This Tutorial
TL;DR: Our SDXL LoRA fine-tune for a Photoroom product photography model trained for six days while silently corrupting its adapter weights. The cause was bf16 gradient accumulation interacting badly with a custom adapter init we'd ported from a paper. Eval scores stayed in the same range the whole time, which is why nobody noticed. The setup We train SDXL LoRAs for product photography categories at Photoroom. Bottles, packaged food, soft goods. Each LoRA is 192MB.