App down for 17h, incident not shown on public status page

Hi all,

my app has been down for 17h already, the suggested action is to re-deploy (or similar actions) but that didn’t help. On the public status page I don’t see any listed incident, but a 17h downtime is not a small thing. What can I do?

Thanks!


I’m experiencing the same. Very frustrating.

100%. I’m sure the team was up all night working this issue - hugsops for all. But the lack of communication is frankly inexcusable. No updates here, no email, no Twitter post.

Mine just came back online!

Hi everyone,

There are a few different points that have come up here and I want to get to each of them but above all I want to apologize for the confusion; expectations for this platform should be clear to all users but clearly we haven’t accomplished that here. Let me try to sort this out.

First, no incident was declared because the failure of an individual host server is not outside the normal operations of the platform. We try to stress multiple times in the docs that the way to ensure uptime on the Fly Platform is by running two or more Machines, and that running an App of a single Machine does risk downtime. But if this is coming as news to any of you, then we need to do more to make sure that all users are aware of this expectation.

We do however have code that is supposed to send out emails to the relevant accounts when an issue for an individual host server is created. It sounds like those emails were not sent, which we are now going to look into.

Finally, @mcfly and maybe others, you said that you tried to re-deploy your app but that didn’t bring it back up. Redeploying alone typically isn’t enough, you should use fly scale count to create new Machines. I’m happy to walk you through that here, but if you don’t want to get into the details of your setup in public, you have an org on the Launch plan so you can also contact support at any time and they’ll help you get back in shape.

Thanks for getting back to us john-fly

I tried fixing via deploy because it was one of the suggested actions in the link on the personal status page. Anyway, I now tried the approach using scale, but it didn’t work for me either. The output I get is this:

 Machine 123 currently has a config that will change with the new fly.toml.  This is what will change:
// followed by lots of red removals (diff), indicating the whole config would be wiped out

It then concludes with:

Error: this app has no complete releases. Run `fly deploy` to create one and rerun this command

I’ve tried running this in various forms:

fly scale count 0 -a ...
fly scale count 0 -c ./path/to/my/fly.toml 

So none of that makes much sense to me:

  • Why would the config be wiped out if I provide one right away? Is this due to the temporary phase of having 0 machines? (in this case I find the error message misleading, especially if I run it providing a fly.toml)
  • What’s the error message not having a complete release and suggesting to redeploy about?
  • Well, the main question obviously is how to fix it?

Generally, I wonder though what has happened here? You say it’s an individual problem, but there was no deploy nor any touchpoints with that app, so no action from our side has caused this. It should be noted that the effected app is the application’s db-cluster, so a rather stable piece of software. And I run it with the suggested 3 server setup, so your point of only having one server isn’t true here. So it does all seem to me that there might have been (and potentially still is) a problem on your side? So I’d appreciate if this gets looked into.

Generally, I’d also appreciate that there is some email notification service if a downtime is detected. I know you have a grafana integration, but something basic like a downtime notification email should not require extra setup from our side.

Thank you!

Hi Marcel,

First a simple thing: We absolutely agree that an email notification should be automatic and not need any configuration or extra setup from users. This is a bug on our end that it did not send.

As for your case, I think there is something more complicated going on there. Actually, checking some things in the backend, I can already see that this is a bit more intricate story. I’m happy to help you debug it here on the forum if you’d like but it’ll involve a bit more of a back and forth, and also might require you to post more sensitive information about your app. You’re on the Launch plan, so you have access to 24/7 support; I suggest that you email support@fly.io so the support team can help me and you can get this resolved as quickly as possible.

@mcfly You said the app that seems to be weird is your “db-cluster” - if this is a Postgres app, then it indeed cannot be managed via fly scale since Postgres apps are somewhat special and they don’t have a configuration or active releases. Just wanted to clarify that!

  • Daniel

Sure, will write you guys there, thanks!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.