Anyone else having their apps go down?

There is an issue with a host that a machine of my app is on, and it seemed to have rendered the whole app useless. I’m trying to move the machine to a different host, but whatever I try I get errors in the flyctl.

Error: could not start machine XYZ: failed to start VM XYZ: request returned non-2xx status, 408 (Request ID: XYZ-ams)
Error: failed to get volume: failed to get volume vol_XYZ: request returned non-2xx status, 408 (Request ID: XYZ-syd)

On fly scale count I get:

Oops, something went wrong! Could you try that again?
1 Like

The steps mentioned here don’t work.

Not currently. … The test app https://debug.fly.dev/ seems to be working.

Is there a verbose flag for that command that would give more detail, perhaps?

Yea there’s --verbose but it doesn’t show anything extra. I don’t have hope for this app to be recovered soon so I am going through the process of recovering a backup we store off-site. And then somehow gotta recover the data since 5:30AM this morning.

A part of one of our apps started hitting outbound timeout errors repeatedly starting about 6 hours ago. When we attempt to deploy, flyctl reports also times out with a 504 error (rather than a 408) when trying to spin up a release command machine.

So we’re getting outbound timeouts within our application and from the fly infrastructure when deploying.

I don’t think it’s the same issue that you’re having, but it has been consistent for us for the last 6 hours. We can’t redeploy the app as a result.

I’m also blocked since my postgres app’s host is down, and I can’t even detach the db from my main app because of Error: no 6pn ips founds for [db] app; can’t restart/stop it because Oops, something went wrong! Could you try that again? or 408 timeout; can’t attach a new postgres instance because Error: no active leader found.

Not sure what to do here, I’ll probably end up recreating my main app as well since I can’t find any way to attach the new postgres instance (tried removing the DATABASE_URL but no luck).

In ams by any chance?

I still have a down app, support asked for logs 5 hours ago and since then I didn’t hear anything

It seems I am having issue deploying as well.

update failed: failed to update VM xxx: request returned non-2xx status, 504

Looking at the logs it seems app us running fine on the port.

Yep, ams.
Now my cloned postgres snapshot from 24 hours ago doesn’t accept connections: 500 Internal Server Error failed to connect to repmgr node: failed to connect to host=[ip] user=repmgr database=repmgr: server error (FATAL: database "repmgr" does not exist (SQLSTATE 3D000)) :confused:

edit: This likely happened because when I cloned it it picked the latest postgres image version, but I have no way of finding out the previous one (shows N/A) and I can’t access a psql shell to attempt manually fixing it; I’ll just delete both apps and reset everything with a local backup …

For your situation, it looks like your volume isn’t reachable along with the same app, so you could try the steps in this guide: Backup, Restores, & Snapshots · Fly Docs - these snapshots seem to be stored offsite (at least not on the same host), but in my case they seem to be broken for some reason (but the data is there).

final edit: I’m finally back up by creating a new empty postgres app and attaching it, then loading the pg_dump I did locally, which ignored the image version mismatch or whatever that was and got the data back up.

By now I have recovered my DB from a backup yes, it works again.

But there’s still another app, without any volumes. And fly scale count 0 just gives an error. It’s impossible to get a healthy machine in that app again…

Ok I just ran fly deploy again and it seems like I got a healthy machine again. Jesus finally

1 Like

Yea seriously frustrated. I can’t do anything!!

$ fly scale count 2
Oops, something went wrong! Could you try that again?

Can the fly team do something here?

I’ve been having this issue since yesterday, glad it’s not just me.
I only see it on two of my fly-hosted applications and not on another.
I submitted a support ticket 15 hours ago but haven’t heard back.

At least the applications seem to be running ok, I just can’t deploy, mostly this error:

Failed to update machines: failed to update machine ...: failed to update VM ...: request returned non-2xx status, 504 Retrying...

We’re seeing the same issue but only for the regions IAD/ORD. A deployment on FRA went through fine.

We’re awaiting a support response as well.

I know that adding a post and stating that I also have this issue is not helping much, but it would be pretty nice if somebody from Fly could look into this, I have 2 hosts that are down, and none of the normal recovery steps seem to be helping.

I am curious of anyone from Fly looks at this? Would they care to respond if they did?

request returned non-2xx status, 504 here too

I wonder if this is why? The API timing out could well be causing those response codes:

This has been marked as resolved but I’m still unable to edit machines.