$ until fly ssh console -a myapp --pty --machine machine-id; do sleep 1; done
Wow this looks like it was a rough night for the team at Fly. It was shaky when I went to bed and now looking again after waking up, youāve been working on this the whole time Iāve been asleep.
This incident ruined my evening too a little bit yesterday. I mistook the side effects of the platform incident for operational issues with my individual site and spent a good hour or so attempting various fixes before I decided to check the Fly status page. I even restored the database from a backup at one point.
Based on that experience, one little idea Iād love to contribute to the post-portem discussions is the possibility of having the fly
CLI print that same helpful āWeāre addressing an incident. Please check the status page for more detailsā message that the dashboard shows during an incident. For many of us the CLI is the dashboard after all.
Good luck getting everything back in working order. Looks to have been a nasty one.
Ordinarily it should. See this PR:
But if the flyctl command itself is timing out or otherwise erroring I could see how that can fail.
I moved to fly.io 3 weeks ago, and there are so many good things about the platform, but this whole event is one of those things that is hard to come back from. The company I am consulting for had some reservations about having their infra here based on the previous outages, and it is going to be very hard to convince them otherwise.
I also cannot deploy anything with
Error: server returned a non-200 status code: 504
message
For real. Got an account maybe a couple months agoā¦kicked the tires. Then about 10 days ago started in earnest to build my infra here. Bad timing.
Stumbled on this a few weeks ago when I was deciding on whether or not to take the plunge. Looks like itās still a struggle with the āStatus pagingā bullet. And that post by the CEO was over 1.5 years ago. Iām thinking things are more beta than release.
Weāre all in the darkā¦our infraās been down the same way all day despite whatever words are on the update page. And now, I guess thereās at least a few like me that are just watching out of morbid curiosity, watching to see if it can hit the 24-hour mark.
Things seem to be slowly coming back online on my end. knock on wood
Same. I can refresh deployments/images.
The API does appear to be responding normally, but our FLAME runners are still fubar. Instances appear to start but are unresponsive, timing out after 30 seconds.
Hmmm. Status page says everythingās fine now. My app was still down from last nightās failed deploy so Iāve tried to redeploy it. That didnāt work, so I tried scaling it down to 0 instances and back up again. That didnāt work so Iāve scaled back down to 0 to try to get a clean slate to restore from.
Now my app is in a state where fly machine list
says No machines are available on this app
. Meanwhile if I run fly logs
I can see my applicationās usual background noise log chatter running actively on a machine ID that fly machine status
claims doesnāt exist any more.
Maybe the fire is out so to speak but it seems there is some cleanup work ahead.
I can deploy now too.
I am still getting failed builds and unable to access my app. I get a failed build after 33s.
For what itās worth, my deployment is now giving a 502 when visiting instead of 504.
Only 302 more visits to get a 200
I hope they are still looking at it and update status page. Pretty unacceptable coming from them to placate that way. Although some may be resolved they should check active community forums and issues before marking something as resolved.
I sincerely hope they will also update their uptime: Fly.io Status - Uptime History
Still showing āNo downtime recorded on this day.ā as of now.
CPU load maxed out once I redeployed again
This is more of ā+1ā / āsame hereā reply.
Weāre having inexplicable issues with our apps too, tried different regions.
Even fly machines list
can return different results 1 sec apart (sometimes āno machinesā, sometimes the correct list).
Another +1 here. Weāre also still experiencing issues with at least one of our apps (an internal service that appears to be unreachable by other apps despite Fly saying the app is running).
Weāve been running production apps on Fly just over a year now and love many things about the platform when it works, but the severity and duration of this incident is making us rethink using Fly going forward. Each of our apps runs at least 2 machines to offer some level of redundancy in case of failure, but thatās irrelevant if the entire cloud service is down.
I have the same issue since last night. An old app that I have not touched for a while is working. But the app I am currently working on and which I have been trying to deploy since last night will not come up and stay up. Thereās a long Slack discussion here: Slack