We have improved how Fly Proxy load balances requests when the closest region to a request is unhealthy. Typically, requests have a strong affinity to where they land — if you arrive at our edge in ord and you have machines in ord, those are the first port of call for handling your traffic. Previously, we tried this more-or-less unilaterally, on the premise that trying the edge-local region is very cheap. It’s not free, though, and in some cases failing to load balance locally can slow the request down enough for our retry storm protection to kick in. One way you could hit this was cordoning all machines in a region, as every request landing there would experience backpressure while the proxy tried locally before looking elsewhere.
We’ve handled this well for remote regions for a while. If a request fails to be served from a region, we mark that region as unhealthy for your app and route traffic elsewhere. Now the local region gets the same treatment. When local load balancing fails, whether due to health, cordoning, or anything else, we track it as unhealthy and most subsequent requests skip straight to multihop, heading out to a healthy region immediately. Successful responses mark the region healthy again, so recovery is automatic.