There is an issue with a host that a machine of my app is on, and it seemed to have rendered the whole app useless. I’m trying to move the machine to a different host, but whatever I try I get errors in the flyctl.
Error: could not start machine XYZ: failed to start VM XYZ: request returned non-2xx status, 408 (Request ID: XYZ-ams)
Error: failed to get volume: failed to get volume vol_XYZ: request returned non-2xx status, 408 (Request ID: XYZ-syd)
On fly scale count I get:
Oops, something went wrong! Could you try that again?
Yea there’s --verbose but it doesn’t show anything extra. I don’t have hope for this app to be recovered soon so I am going through the process of recovering a backup we store off-site. And then somehow gotta recover the data since 5:30AM this morning.
A part of one of our apps started hitting outbound timeout errors repeatedly starting about 6 hours ago. When we attempt to deploy, flyctl reports also times out with a 504 error (rather than a 408) when trying to spin up a release command machine.
So we’re getting outbound timeouts within our application and from the fly infrastructure when deploying.
I don’t think it’s the same issue that you’re having, but it has been consistent for us for the last 6 hours. We can’t redeploy the app as a result.
I’m also blocked since my postgres app’s host is down, and I can’t even detach the db from my main app because of Error: no 6pn ips founds for [db] app; can’t restart/stop it because Oops, something went wrong! Could you try that again? or 408 timeout; can’t attach a new postgres instance because Error: no active leader found.
Not sure what to do here, I’ll probably end up recreating my main app as well since I can’t find any way to attach the new postgres instance (tried removing the DATABASE_URL but no luck).
Yep, ams.
Now my cloned postgres snapshot from 24 hours ago doesn’t accept connections: 500 Internal Server Error failed to connect to repmgr node: failed to connect to host=[ip] user=repmgr database=repmgr: server error (FATAL: database "repmgr" does not exist (SQLSTATE 3D000))
edit: This likely happened because when I cloned it it picked the latest postgres image version, but I have no way of finding out the previous one (shows N/A) and I can’t access a psql shell to attempt manually fixing it; I’ll just delete both apps and reset everything with a local backup …
For your situation, it looks like your volume isn’t reachable along with the same app, so you could try the steps in this guide: Backup, Restores, & Snapshots · Fly Docs - these snapshots seem to be stored offsite (at least not on the same host), but in my case they seem to be broken for some reason (but the data is there).
final edit: I’m finally back up by creating a new empty postgres app and attaching it, then loading the pg_dump I did locally, which ignored the image version mismatch or whatever that was and got the data back up.
By now I have recovered my DB from a backup yes, it works again.
But there’s still another app, without any volumes. And fly scale count 0 just gives an error. It’s impossible to get a healthy machine in that app again…
I’ve been having this issue since yesterday, glad it’s not just me.
I only see it on two of my fly-hosted applications and not on another.
I submitted a support ticket 15 hours ago but haven’t heard back.
At least the applications seem to be running ok, I just can’t deploy, mostly this error:
Failed to update machines: failed to update machine ...: failed to update VM ...: request returned non-2xx status, 504 Retrying...
I know that adding a post and stating that I also have this issue is not helping much, but it would be pretty nice if somebody from Fly could look into this, I have 2 hosts that are down, and none of the normal recovery steps seem to be helping.