Our retry loop made an outage worse. The circuit breaker stopped the cascade.

Dev.to AI
Generative AI

A few weeks back there was a 22-minute window where Anthropic returned a high rate of 5xx responses. Not a full outage. Degraded. Our agent service had a retry policy that backed off and tried again on 5xx. With six workers and a shared retry budget that I had set too high, we were re-issuing failed calls about as fast as the API was returning errors. When the API recovered, we had a backlog of in-flight retries that pushed us straight back into rate limiting.