After using fly.io for several applications all of the sudden one of our services wont become available via http after a deployment, even tho the logs suggest its running.
When requesting any url we get a 503 error after a few seconds, and in the instance logs the following gets logged
[PR04] could not find a good candidate within 20 attempts at load balancing
The instance seems to get reset afterwards
When sshing into the machine and using wget to try to see if we can load https://localhost:8080 we get the response we’re expecting
We haven’t changed the configuration within the app or in our fly.toml file (except for trying to get it to work again). We’ve already tried redeploying the settings, and both app settings (it’s a aspnet core application) and .toml settings are exactly the same as for other applications that do still work.
Could this be something on fly’s side and if so how can we confirm this? We’re apprehensive deleting the whole app and restarting again because we have quite some 3rd parties referring to our current URL.
This means there are no healthy app instances to route requests to. In general, this could be because your Machines are failing their health checks or they’ve each reached their hard limit of concurrent requests (or connections) set in fly.toml.
I poked at your app and noticed you had–and then removed–your health check in fly.toml. Despite this, I still see old health check data persisted on our end, and your checks were last in critical state before they were removed.
What I think is happening here is a bug: our service discovery system is still holding on to your app’s once-failing health checks. This stale state would then cause Fly Proxy to view your app as unhealthy and stop routing requests to it.
The stale state should only apply to existing Machines. New ones should be created with a clean health check slate.
Try cloning a new Machine with fly machine clone <id>, wait for it to start, and then ping your app again. If your new Machine’s responding to requests, you can destroy the old (possibly stale) Machines using fly machine destroy <id>.