App down for 17h, incident not shown on public status page

Hi everyone,

There are a few different points that have come up here and I want to get to each of them but above all I want to apologize for the confusion; expectations for this platform should be clear to all users but clearly we haven’t accomplished that here. Let me try to sort this out.

First, no incident was declared because the failure of an individual host server is not outside the normal operations of the platform. We try to stress multiple times in the docs that the way to ensure uptime on the Fly Platform is by running two or more Machines, and that running an App of a single Machine does risk downtime. But if this is coming as news to any of you, then we need to do more to make sure that all users are aware of this expectation.

We do however have code that is supposed to send out emails to the relevant accounts when an issue for an individual host server is created. It sounds like those emails were not sent, which we are now going to look into.

Finally, @mcfly and maybe others, you said that you tried to re-deploy your app but that didn’t bring it back up. Redeploying alone typically isn’t enough, you should use fly scale count to create new Machines. I’m happy to walk you through that here, but if you don’t want to get into the details of your setup in public, you have an org on the Launch plan so you can also contact support at any time and they’ll help you get back in shape.