What happens if we stop retrying?

The retry is the most confident line of code we write. Think about what it claims: the same request, sent again, will produce a different outcome. Sometimes that’s true — a dropped packet, a node mid-restart. But we don’t retry because we’ve established that. We retry because it’s easy and it usually looks like it works.

So, the thought experiment: what happens if we stop?

Take a service that’s slow because it’s overloaded. Callers time out and retry. Each retry is a brand-new request the service must also fail, which makes it slower, which causes more timeouts, which causes more retries. We have a name for this — a retry storm — and yet we keep writing the loop, because each individual retry looks reasonable. It’s the traffic jam problem: nobody thinks they’re the traffic.

Growing up in South Africa you learn this at the four-way stop when the robots are out — that’s traffic lights, for the rest of you. When power dies and every intersection loses its signals, the system doesn’t collapse. Everyone downgrades to the four-way-stop protocol: slower, more careful, strictly take-your-turn. Throughput drops; the system survives. Nobody responds to congestion by driving at the intersection harder.

Our retry loops do exactly that. Failure makes them more aggressive, not less. So if the retry claims “this will work next time,” maybe the honest version is: under what conditions is that claim true? Transient fault, yes. Overload, no — and worse than no.

Which suggests the answer to the experiment isn’t “never retry.” It’s that a retry is a bet, and bets need conditions: budgets, backoff with jitter, circuit breakers — all ways of encoding when the bet is off. Remove the retries and you’re forced to see the failure for what it is. Keep them unconditionally and you’ve built a machine for hiding the question.

I’d rather see the failure. It’s usually trying to tell me something.