Deployment failing: failed to list VMs even after retries: request returned non-2xx status, 500

failed to list VMs even after retries: request returned non-2xx status, 500

1 Like

in thew dashboard builder is failing to list as well, deployments have been super flaky and its getting annoying now and hard to justify to run production workloads with guaranteed uptimes to clients

I also have the same error on some of my apps.

Have even sent an email to support email address 2hrs and still no response from there either. What’s the SLA for email support ?

We’ve also been blocked from deploying for several hours now. I like fly - but can’t justify to keep using with issues like this recurring every few weeks.

We’ve mitigated the issue. Are you still seeing the 500 errors?

Thanks for resolving it but IMHO things have been fairly unreliable and its impacting lot of business metrics to justify using fly.io as a preferred platform from a business standpoint

I am seeing this in DFW right now

2 Likes

I also had this issue in DFW up until a minute ago

The status page said the issue had been resolved but it had not been for me:

Fly might need better health checks, it is disconcerting when a feature doesn’t work while the status page says it does.

Resolution should not mean “a fix is propagating,” Resolution should mean the feature works again.

1 Like

We have been seeing this issue for the past few hours as well.

Error: failed to list VMs even after retries: request returned non-2xx status, 500 (Request ID: 01HWTDERD5EHX18Y0NCJZTC2E7-den) (Trace ID: f6148a63c193a46b7864bc11fdb324d4)

Currently, it is preventing us from making new deploys.

In fact, I can’t also can’t provision anything new, in DFW, DEN, ORD. The status page says all is resolved, though.

‘fly launch’ without a fly.toml, customize in web UI, and then immediately:

Waiting for launch data... Done
Error: request returned non-2xx status

That leaves a new app squatting on the namespace, in the ‘pending’ state, without having provisioned the database, yet.

Going to try fly apps rm name, fly launch again and see if this whole flow is consistently blocked.

This time, fly launch progressed a little further, but then the same issue tanked the provisioning of the database:

Failed creating the Postgres cluster appname-db: 
failed to create volume: failed to create volume: 
request returned non-2xx status, 500

This time, it left ‘pending’ apps for both ‘appname’ and ‘appname-db’ that I get to clean up. Will try again …

Update: tried again. Exactly the same as above; the fly launch doesn’t hit a ‘500’ issue immediately, but now 2x in a row it does when attempting to get a volume for the database.

We tracked down and fixed an API issue affecting a single host in the den region, which was causing 500 errors to be intermittently returned from Machines API requests routed through this region, between 2024-04-30T22:49:00Z→2024-05-01T18:26:00Z. This has been fixed so the API requests should no longer be intermittently failing.

This is great – was a bit hectic yesterday, so when I came back to this tab this morning, I see I have an un-posted draft mentioning that I was going to VPN into another region and try CLI from there! That did work :laughing: yeah I noticed DEN kept showing up in responses from time to time and figured that might be why.

Thanks for transparency y’all, it’s much appreciated :heart: :sweat_smile: