Our app has also been broken in Amsterdam for almost 24hs now:
2023-09-05T16:13:27.348 runner[286560eae70de8] ams [info] machine exited with exit code 0, not restarting
could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shutdown? is there an ongoing deployment with a volume or using the 'immediate' strategy? has your app's instances all reached their hard limit?)
And trying to redeploy the app fails, too:
$ fly deploy
==> Verifying app config
Validating /Users/foo/dev/app/remix/fly.toml
Platform: machines
âś“ Configuration is valid
--> Verified app config
WARN DATABASE_URL may be a potentially sensitive environment variable. Consider setting it as a secret, and removing it from the [env] section: https://fly.io/docs/reference/secrets/
==> Building image
Waiting for remote builder fly-builder-falling-smoke-8492... 🌎WARN The running flyctl agent (v0.1.81) is older than the current flyctl (v0.1.83).
WARN The out-of-date agent will be shut down along with existing wireguard connections. The new agent will start automatically as needed.
WARN Failed to start remote builder heartbeat: failed building options: agent: failed to start
Error: failed to fetch an image or build from source: error connecting to docker: failed building options: agent: failed to start
The agent failed to start with the following error log:
A copy of this log has been saved at /Users/foo/.fly/agent-logs/339000168.log
I looked up the Machine ID from your logs and see that it’s on a host in ams that is down due to emergency maintenance. Unfortunately, apps with just one Machine running on the affected host will not be reachable until maintenance is complete. Running 2+ Machines is our recommendation to prevent app downtime in the event of host-side failures.
Let’s see if scaling up helps to bring your app back: fly scale count 2. This should start a new Machine on a different, healthy host in the region and hopefully get your app up and running again. Feel free to follow up with error logs if this doesn’t work for you as expected.
A note to others reading this: if you’re seeing similar errors and want to know whether a downed host is affecting your app, check out your personal status page (also accessible from your org’s dashboard).
I’m having the exact same issue. My app in the ams region has been completely broken for the past couple of days now, as you can see from my Uptime Kuma graph. I didn’t change anything myself, and it has been running fine for months now.
fly ssh console: it says Error: error connecting to SSH server: connect tcp ... operation timed out
fly scale count 0 and then fly scale count 1: didn’t fix anything
fly scale count 2: didn’t fix anything, both machines are now in a broken state
Any ideas? I need this to be up and running soon.
Update: found the problem, it turns out my app is waiting for my Postgres database, but the connection between the app and the database is broken for some reason…
I still can’t ssh into my main app though. I’ve tried fly wireguard reset but it says Error: upstream service is unavailable