Anyone else having their apps go down?

Yaeger · September 8, 2024, 9:31pm

There is an issue with a host that a machine of my app is on, and it seemed to have rendered the whole app useless. I’m trying to move the machine to a different host, but whatever I try I get errors in the flyctl.

Error: could not start machine XYZ: failed to start VM XYZ: request returned non-2xx status, 408 (Request ID: XYZ-ams)

Error: failed to get volume: failed to get volume vol_XYZ: request returned non-2xx status, 408 (Request ID: XYZ-syd)

On fly scale count I get:

Oops, something went wrong! Could you try that again?

Yaeger · September 8, 2024, 9:39pm

The steps mentioned here don’t work.

greg · September 8, 2024, 10:03pm

Not currently. … The test app https://debug.fly.dev/ seems to be working.

Is there a verbose flag for that command that would give more detail, perhaps?

Yaeger · September 8, 2024, 10:07pm

Yea there’s --verbose but it doesn’t show anything extra. I don’t have hope for this app to be recovered soon so I am going through the process of recovering a backup we store off-site. And then somehow gotta recover the data since 5:30AM this morning.

ctcb · September 9, 2024, 5:55am

A part of one of our apps started hitting outbound timeout errors repeatedly starting about 6 hours ago. When we attempt to deploy, flyctl reports also times out with a 504 error (rather than a 408) when trying to spin up a release command machine.

So we’re getting outbound timeouts within our application and from the fly infrastructure when deploying.

I don’t think it’s the same issue that you’re having, but it has been consistent for us for the last 6 hours. We can’t redeploy the app as a result.

arijan · September 9, 2024, 7:23am

I’m also blocked since my postgres app’s host is down, and I can’t even detach the db from my main app because of Error: no 6pn ips founds for [db] app; can’t restart/stop it because Oops, something went wrong! Could you try that again? or 408 timeout; can’t attach a new postgres instance because Error: no active leader found.

Not sure what to do here, I’ll probably end up recreating my main app as well since I can’t find any way to attach the new postgres instance (tried removing the DATABASE_URL but no luck).

Yaeger · September 9, 2024, 7:26am

In ams by any chance?

Yaeger · September 9, 2024, 7:31am

I still have a down app, support asked for logs 5 hours ago and since then I didn’t hear anything

rabin · September 9, 2024, 7:34am

It seems I am having issue deploying as well.

update failed: failed to update VM xxx: request returned non-2xx status, 504

Looking at the logs it seems app us running fine on the port.

arijan · September 9, 2024, 7:35am

Yep, ams.
Now my cloned postgres snapshot from 24 hours ago doesn’t accept connections: 500 Internal Server Error failed to connect to repmgr node: failed to connect to host=[ip] user=repmgr database=repmgr: server error (FATAL: database "repmgr" does not exist (SQLSTATE 3D000))

edit: This likely happened because when I cloned it it picked the latest postgres image version, but I have no way of finding out the previous one (shows N/A) and I can’t access a psql shell to attempt manually fixing it; I’ll just delete both apps and reset everything with a local backup …

For your situation, it looks like your volume isn’t reachable along with the same app, so you could try the steps in this guide: Backup, Restores, & Snapshots · Fly Docs - these snapshots seem to be stored offsite (at least not on the same host), but in my case they seem to be broken for some reason (but the data is there).

final edit: I’m finally back up by creating a new empty postgres app and attaching it, then loading the pg_dump I did locally, which ignored the image version mismatch or whatever that was and got the data back up.

Yaeger · September 9, 2024, 7:37am

By now I have recovered my DB from a backup yes, it works again.

But there’s still another app, without any volumes. And fly scale count 0 just gives an error. It’s impossible to get a healthy machine in that app again…

Yaeger · September 9, 2024, 7:38am

Ok I just ran fly deploy again and it seems like I got a healthy machine again. Jesus finally

madhu · September 9, 2024, 9:11am

Yea seriously frustrated. I can’t do anything!!

$ fly scale count 2
Oops, something went wrong! Could you try that again?

Can the fly team do something here?

ejschmitt · September 9, 2024, 9:33am

I’ve been having this issue since yesterday, glad it’s not just me.
I only see it on two of my fly-hosted applications and not on another.
I submitted a support ticket 15 hours ago but haven’t heard back.

At least the applications seem to be running ok, I just can’t deploy, mostly this error:

Failed to update machines: failed to update machine ...: failed to update VM ...: request returned non-2xx status, 504 Retrying...

S3bb1 · September 9, 2024, 10:51am

We’re seeing the same issue but only for the regions IAD/ORD. A deployment on FRA went through fine.

We’re awaiting a support response as well.

le_simon · September 9, 2024, 11:30am

I know that adding a post and stating that I also have this issue is not helping much, but it would be pretty nice if somebody from Fly could look into this, I have 2 hosts that are down, and none of the normal recovery steps seem to be helping.

madhu · September 9, 2024, 12:04pm

I am curious of anyone from Fly looks at this? Would they care to respond if they did?

Mrka · September 9, 2024, 12:20pm

request returned non-2xx status, 504 here too

greg · September 9, 2024, 3:35pm

I wonder if this is why? The API timing out could well be causing those response codes:

duld · September 9, 2024, 3:47pm

This has been marked as resolved but I’m still unable to edit machines.

Topic		Replies	Views
Issue Getting Data After Host Machine Crash Questions / Help	3	182	April 30, 2024
Something went wrong? Questions / Help	42	1423	September 22, 2022
App is now randomly "Not Deployed" - Why?	8	538	August 3, 2021
It's been 38hs and my instance is still experiencing an outage	8	489	October 4, 2023
Deploy consistently failing	2	65	September 16, 2024

Anyone else having their apps go down?

Related topics