Fly.io machine is down again - another incident?

Unfortunately, this week seems to be quite unstable for Fly.io.
First, I experienced the incident from Oct 23 and now one of my three machines seems to be down.

I can’t deploy, scale or do anything with the unavailable machine.
I’m getting logs for the last 1 hour:

could not find a good candidate within 21 attempts at load balancing. last error: [PU02] could not complete HTTP request to instance: operation was canceled: request has been canceled

Am I the only lucky here? :frowning_face: Fly.io status page isn’t showing anything for now

The machine is slowly coming back to life without doing anything but deployments are still unavailable

and the deployments are working now

Got the same thing: https://community.fly.io/t/waw-hosts-are-down-clean-status-page-cant-redeploy/22468
Which region are you affected on btw?

apparently, the same as yours and the status page hasn’t mentioned about today’s incident at all

Same here. Unable to deploy failing at Waiting for depot builder.... I posted on X tagging them but no response yet.

1 Like

Same issue here. Sitting there waiting for the depot builder. Getting a bit concerned about the large number of outages here…

I had the same issue yesterday with exactly the same error message. Then it was fixed by itself (by the fly.io team, I suppose).

A bit frustrating.

Agreed it’s been nearly 24 hours without any change. My machines won’t start up and I can’t deploy updates… The status page shows no issues at all. I wonder if it’s only a small subset of users or maybe a particular region (I am in EWR mostly)? If everyone was experiencing the same issues I can’t imagine there being so little noise around it. Has this resolved for anyone else?

Agreed though it’s very concerning. I only recently opted to give Fly a shot over DO and right now it feels like a mistake.

Hi folks,

Yesterday a couple of hosts in waw were unavailable for a few hours due to a network switch failure (now fixed), and today a single host in ewr experienced a hardware failure. These individual-host events are not ‘incidents’ but routine occurrences. We don’t post to the global status page every time a server goes down to our global status page; instead, we log issues for each of them in the personalized status page, so if you had an affected app deployed on an unlucky server, you should have been able to find relevant info there (and please let us know if you have any suggestions on how to improve this experience!).

Depending on the nature of the host failure, a host can be down for an hour, a day or sometimes longer. If you end up with an app on an unavailable host you want to get back online more quickly, refer to Troubleshoot apps when a host is unavailable · Fly Docs with some troubleshooting/recovery tips. In general, we recommend apps with high-availability requirements run at least two Machines per app in your primary region to better defend against single-host issues like this.

While I know it’s a frustrating situation to end up with a machine on an unavailable host (we’re not too fond of random host issues ourselves), I hope this info provides a bit of clarity.

Ahhh so my builder is in the affected region. Is there a way to change the region for the builder?

1 Like

My machines are running fine now but I am still unable to deploy to ewr getting stuck at the same Waiting for depot builder. I also never saw anything listed in the personalized status page.

Thanks for the information! Would love to get this deploy working.

Edit: I actually don’t see a builder machine in my org anymore. I could have sworn I use to see one in the UI. Maybe I should open another topic for this…

This fixed it for me Depot builder outage (and workaround!)

Seems like for regional outages, the personal status page don’t keep a record once the incidents clear. There doesn’t seem to be a way to receive email/text alerts either. Can Fly offer better ways for us to track downtimes versus just stalk that page like a hawk?

2 Likes

Thank you for the explanation!

Regarding the user experience of individual-host events, it would be fantastic if there were an automated email notification with a link to the individual status page whenever something like that occurs. I’m suggesting this because, as a two-year-old user, I completely forgot about the existence of the individual status tab.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.