Target machine starts successfully and serves health checks (healthcheck with 200)
source machine gets 502 after ~35-36s
Target machine logs at same time show:
[PR03] could not find a good candidate within 40 attempts at load balancing
last error: [PU03] unreachable worker host. the host may be unhealthy. this is a Fly issue.
I’m fairly lost at this point
Has anyone seen similar behavior, and what diagnostics/checklist would you recommend to isolate whether this is service config, process-group/routing setup, or platform-side networking/proxy behavior?
Our statuspage only shows platform-wide issues (or issues affecting at least multiple nodes), this was an issue specific to one single pair of hosts. I do agree it should have still generated a warning in your app’s dashboard (we have something specifically for this kind of single-host failures), but that wasn’t triggered this time, which is a problem I am going to fix. Apologies for the inconvenience.