This topic has also now been made private and can’t be found if you haven’t already commented on it.
There was no notification (email or comment here) of services being restored.
If this was one of our (paid) production instances that was affected rather than a dev one - frankly, I’d be complete reevaluating if we would continue using Fly for hosting - the zero communication on this has really shocked and worried me now on how something more important to us would be handled.
All I’m asking for is some official message or reassurance on what’s happened here - I don’t think that’s too much to ask - our customers would expect it of our company if something went down for 3 days
I deleted my Twitter account a while ago, but perhaps a tweet to someone senior about this could shed some light on what’s going on?
Fly’s CEO (@mrkurt) wrote this blog post about their recent raise and specifically calls out “Support and reliability” as one of the focuses for them as a company.
Our app that went down was a prod client app (luckily alpha so few actual users using it, but yeah.)
After another incident a few hours ago with our newly-rebuilt DB machines (which we thought having two instances of would address any issues), we are indeed seriously re-evaluating our usage of Fly.
Just wanted to provide some more details on what happened here, both with the thread and the host issue.
The radio silence in this thread wasn’t intentional, and I’m sorry if it seemed that way. While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally. If we’d seen it earlier, we’d have offered more details then.
More on what happened: We had a single host in the syd region go down, hard, with multiple issues. In short, the host required a restart, then refused to come back online cleanly. Once back online, it refused to connect with our service discovery system. Ultimately it required a significant amount of manual work to recover.
Apps running multiple instances would have seen the instance on this host go unreachable, but other instances would have remained up and new instances could be added. Single instance apps on this host were unreachable for the duration of the outage. We strongly recommend running multiple instances to mitigate the impact of single-host failures like this.
The main status page (status.fly.io) is used for global and regional outages. For single host issues like this one we post alerts on the status tab in the dashboard (the emergency maintenance message @south-paw posted). This was an abnormally long single-host failure and we’re reassessing how these longer-lasting single-host outages are communicated.
It sucks to feel ignored when you’re having issues, even when it’s not intentional. Sorry we didn’t catch this thread sooner.
My current problem with Fly is when something happens, I can’t be confident that the problem is on our side.
Although their support team is doing a great job and is hands-down one of the best technical support teams I’ve worked on so far, the platform itself is a hedgehog of splinters.
If not for the support team, we would be out at least a month ago. It’s one of their most significant assets.
The idea is excellent. The execution, however, makes it like walking over a minefield. And here I am - taking out my frustration on a topic that is not even about the region we are hosted in