fly.io site is currently inaccessible..

$ until fly ssh console -a myapp --pty --machine machine-id; do sleep 1; done

:sweat_smile:

1 Like

Wow this looks like it was a rough night for the team at Fly. It was shaky when I went to bed and now looking again after waking up, youā€™ve been working on this the whole time Iā€™ve been asleep.

This incident ruined my evening too a little bit yesterday. I mistook the side effects of the platform incident for operational issues with my individual site and spent a good hour or so attempting various fixes before I decided to check the Fly status page. I even restored the database from a backup at one point.

Based on that experience, one little idea Iā€™d love to contribute to the post-portem discussions is the possibility of having the fly CLI print that same helpful ā€œWeā€™re addressing an incident. Please check the status page for more detailsā€ message that the dashboard shows during an incident. For many of us the CLI is the dashboard after all.

Good luck getting everything back in working order. Looks to have been a nasty one.

3 Likes

Ordinarily it should. See this PR:

But if the flyctl command itself is timing out or otherwise erroring I could see how that can fail.

1 Like

I moved to fly.io 3 weeks ago, and there are so many good things about the platform, but this whole event is one of those things that is hard to come back from. The company I am consulting for had some reservations about having their infra here based on the previous outages, and it is going to be very hard to convince them otherwise.

1 Like

I also cannot deploy anything with

Error: server returned a non-200 status code: 504

message

1 Like

For real. Got an account maybe a couple months agoā€¦kicked the tires. Then about 10 days ago started in earnest to build my infra here. Bad timing.

Stumbled on this a few weeks ago when I was deciding on whether or not to take the plunge. Looks like itā€™s still a struggle with the ā€˜Status pagingā€™ bullet. And that post by the CEO was over 1.5 years ago. Iā€™m thinking things are more beta than release.

Weā€™re all in the darkā€¦our infraā€™s been down the same way all day despite whatever words are on the update page. And now, I guess thereā€™s at least a few like me that are just watching out of morbid curiosity, watching to see if it can hit the 24-hour mark.

1 Like

Things seem to be slowly coming back online on my end. knock on wood

Same. I can refresh deployments/images.

The API does appear to be responding normally, but our FLAME runners are still fubar. Instances appear to start but are unresponsive, timing out after 30 seconds.

1 Like

Hmmm. Status page says everythingā€™s fine now. My app was still down from last nightā€™s failed deploy so Iā€™ve tried to redeploy it. That didnā€™t work, so I tried scaling it down to 0 instances and back up again. That didnā€™t work so Iā€™ve scaled back down to 0 to try to get a clean slate to restore from.

Now my app is in a state where fly machine list says No machines are available on this app. Meanwhile if I run fly logs I can see my applicationā€™s usual background noise log chatter running actively on a machine ID that fly machine status claims doesnā€™t exist any more.

Maybe the fire is out so to speak but it seems there is some cleanup work ahead.

2 Likes

I can deploy now too.

1 Like

I am still getting failed builds and unable to access my app. I get a failed build after 33s.

1 Like

For what itā€™s worth, my deployment is now giving a 502 when visiting instead of 504.

1 Like

Only 302 more visits to get a 200 :joy:

4 Likes

I hope they are still looking at it and update status page. Pretty unacceptable coming from them to placate that way. Although some may be resolved they should check active community forums and issues before marking something as resolved.

1 Like

I sincerely hope they will also update their uptime: Fly.io Status - Uptime History

Still showing ā€œNo downtime recorded on this day.ā€ as of now.

CPU load maxed out once I redeployed again

1 Like

This is more of ā€œ+1ā€ / ā€œsame hereā€ reply.
Weā€™re having inexplicable issues with our apps too, tried different regions.

Even fly machines list can return different results 1 sec apart (sometimes ā€œno machinesā€, sometimes the correct list).

1 Like

Another +1 here. Weā€™re also still experiencing issues with at least one of our apps (an internal service that appears to be unreachable by other apps despite Fly saying the app is running).

Weā€™ve been running production apps on Fly just over a year now and love many things about the platform when it works, but the severity and duration of this incident is making us rethink using Fly going forward. Each of our apps runs at least 2 machines to offer some level of redundancy in case of failure, but thatā€™s irrelevant if the entire cloud service is down.

2 Likes

I have the same issue since last night. An old app that I have not touched for a while is working. But the app I am currently working on and which I have been trying to deploy since last night will not come up and stay up. Thereā€™s a long Slack discussion here: Slack

1 Like