Question on Host Downtime

This outage yesterday for a single host didn’t impact me, but I’m wondering what happens during host downtime at Fly for my own knowledge. Does Fly redeploy apps on working hosts when there are no volumes bound to the app that would require it to be on the same host? If not, are there manual steps we can take to move it? Does redeploying the app move it? I understand that apps with volumes attached probably can’t move easily, but the apps I am worried about do not have volumes.

Also, on the per-org status page in the Fly dashboard that now exists…is that information available via the Fly GraphQL API? If not currently, will it ever be? I did a quick look through the docs and didn’t see it mentioned, but I might have missed it.

1 Like

Kind of! It very much depends on the issue. The tldr is that you’re better off running scale count >=2 if you want maximum uptime.

On Nomad apps, Nomad will reschedule a VM (if it can) when it detects that the host is gone. However, that’s kind of a fuzzy detection. Nomad routinely notices it hasn’t heard from a host even when it’s happy, so there’s a pretty serious lag before it marks things lost. More than 15 minutes in most cases. So when hosts fail catastrophically, Nomad doesn’t really help.

When we notice something wrong with a host running Nomad, we can drain the node and let it reschedule elsewhere. This will probably not be noticeable from your end. You’ll need a new VM start, and the other go away.

Machines apps are slightly different. They’re pinned to hosts just like volumes are. You will want 2+ machines if you need high availability. These can scale to zero, though, so you can create 3 machines and leave 2 of them turned off.

Our proxy will start machines when it needs to. We haven’t quite shipped the feature to stop machines when an app is less busy, but it should land any day now.

I think the Machines model is probably better, but it does put more load on your brain. One problem with outages is that creating new VMs has a surprising number of moving parts. Machines can start without referencing any external system, but creating a new machines requires a Docker pull, a service registry update, etc. I think we can make flyctl do the right thing with machines, though, and solve most of this problem for you.

There’s no API for application level status yet. We don’t really have one planned right now.

That sounds very important for those who migrate to machines. I don’t think I saw this anywhere in docs or in the machines announcement post, or in the post about making machines the new default. I mean the point, that you need a spare (stopped) machines for availability.

If I understand it correctly, your app needs to have 2 machines minimum, one of which can be stopped. Because if a host goes down and you have only 1 machine, it will go down until that host is revived. If you have a second stopped machine it will be automatically started on another host (because it’s already assigned to another host) unless that another host is overbooked by that moment. You are not paying for stopped machines (but you will be paying for volumes attached to stopped machines if you have any).

This is more than high availability. A host can get down for hours, and it will take your app down for hours as well if you moved to apps v2 and kept 1 machine like you did with a nomad instance. Many could tolerate minutes of downtime, but not hours. It feels like a spare stopped machine is a must

@kurt please correct me if I’m wrong.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.