Our app just went down suddenly with the following error:
last error: unreachable worker host. the host may be unhealthy. this is a Fly issue.
Is there any way we should do to recover from this? or do we just wait for fly to fix the issue from their side?
I’m having the same issue on my Ruby Sinatra app. This is what I have in the logs:
2024-01-23T11:40:16.795 proxy[91857932a1e238] cdg [error] could not find a good candidate within 90 attempts at load balancing. last error: could not complete HTTP request to instance: operation was canceled: request has been canceled
2024-01-23T11:40:19.782 proxy[91857932a1e238] cdg [error] could not find a good candidate within 90 attempts at load balancing. last error: unreachable worker host. the host may be unhealthy. this is a Fly issue.
Tried restarting my machine but nothing seems to work. Is there an outage going on?
I started getting errors from my rails app connecting to PSQL (all on Fly) around 20mins ago (1130) - monitoring suggests the PSQL instance isn’t accepting connections and I can’t proxy from my shell to the PSQL instance over Fly either.
Also getting lots of errors on the Fly dashboard with monitoring disconnecting, unable to retrieve machine details and “Failed to establish connection to NATS server”
Everything is hosted in the LHR region
Update: this is what I’m getting in my Rails logs when it tries to start:
2024-01-23T12:02:32Z app[9080010c614158] lhr [info]PG::ConnectionBad: connection to server at "fdaa:2:5f84:0:1::2", port 5432 failed: server closed the connection unexpectedly
2024-01-23T12:02:32Z app[9080010c614158] lhr [info] This probably means the server terminated abnormally
2024-01-23T12:02:32Z app[9080010c614158] lhr [info] before or while processing the request.
Not sure if it’s related, but… Is anyone able to open their dashboard? Getting Error 500 for a while but every other part of fly.io (like the docs and status) seem to be working file.
There is definitely something up in LHR. 4 of our 6 machines there are totally unreachable and the CLI is returning 500s when trying to scale machines there.
Other regions we are running in seem fine currently though.
Yes, a few apps down in LHR for the past hour. Restarting the machines involved hasn’t worked. Had a good several months with no problems though I do sometimes wonder about moving out of the LHR region for what I’m doing because it seems the touchiest?
We’re also seeing issues with LHR with 8 of our apps, across two orgs.
Seems like things are coming back now.
Are things working for other people? All our apps are still currently down.
Is there any way for us to mitigate this going forward? Why aren’t any of the fly.io status pages updated to show this downtime?
I had to
fly deploy again for the service to go back up.
Is there anyway we can mitigate against this other than a multi-region deployment?
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.