Minimizing Impact of Dead Hosts: New Features and Recovery Techniques

Machines are an abstraction we take pride in; they are lightweight and fast, secure and private, disposable and repeatable. But most importantly, they may and will crash at some point because they are tied to Hosts that, even with redundancy, are subject to hardware failures.

A dead and unreachable Host is unavoidable. It can go down due to a broken network device, a power outage, a malfunctioning fan causing overheating, a multi-disk failure, or just a stroke of bad luck. Dead Hosts are a reality—it happens and has happened. We are exploring ways to minimize the impact, making it easier for your apps to keep running and recover quickly with minimal manual intervention.

Many users run apps across multiple machines that our system spread among various hosts. These users rarely experience total app outages. However, a significant number of users run single-machine apps, and even if we are vocal against it, it keeps happening.

Unreachable Machines are something a multi-machine app can handle, but dead hosts also cause unmanageable machines. This means machine update and destroy operations may fail, blocking application upgrades (a.k.a. fly deploy).

What we are introducing today is the ability to flag a machine as to-be-destroyed. The API call remains the same and is the well-known DestroyMachine request. When Fly detects that the host where the machine is running isn’t responsive, it will change the machine’s state to destroying and ensure it is destroyed as soon as the host comes back (if it ever does).

Every distributed system has trade-offs. A host may be temporarily unreachable due to a network partition, such as someone stepping on a network cable. To the rest of the system, the host appears dead, and a request to destroy the machine on that host may be issued. Shortly after, if the host returns, there is a high chance the machine was still running and may attempt to reconnect to the network. The system will persist in destroying it, but you might experience up to a minute of “undead” activity.

This generally isn’t a problem if your app only listens to services served by Fly-Proxy, as the proxy ignores machines in destroyed and destroying states. However, it can be an issue if the machine serves as a worker and processes messages from a queue or similar. This problem isn’t new; it’s akin to losing a worker mid-processing. You need to account for it regardless. What’s new is that the machine may return running outdated code for a short period of time.

Recovering from unreachable machines

Machines without volumes are easier to handle. They are what we call stateless and can be moved around. We are working towards automatically recreating stateless machines on other hosts when they become unreachable. This process is similar to machine migrations but has limitations because, without the source host, it’s more difficult to reconstruct the final machine configuration. Yes, it’s a bit odd, but that’s a topic for another post.

In the meantime, starting with flyctl v0.1.112, running fly deploy will attempt to upgrade your app even if some machines are on a dead host. Under the hood, it replaces the affected machines with healthy ones.

When volumes are involved, recovery is more challenging. The main issue is that volumes are tied to hosts, and when a host goes down, the volume goes with it. Running fly deploy will error in this case and indicate which machine and volume are affected.

Volume recovery depends on your app. If the volume data isn’t crucial, destroy the machine and scale up the app by one to start with an empty volume. (Pro tip: cloning the dead machine won’t work, but cloning a sibling will.)

❯ fly machine destroy --force MACHINEID
...
❯ fly scale count $EXPECTED_MACHINES
... 

If the volume data is important, you can still recover from a snapshot:

❯ fly vol list
ID                      STATE   NAME    SIZE    REGION  ZONE    ENCRYPTED       ATTACHED VM     CREATED AT
vol_jo4v4mv88keymd0n*   created data    10GB    iad     0002    true                            1 day ago
vol_jyx9x59me85w8exr    created data    10GB    iad     0001    true            7591857c12836d  1 day ago

* These volumes' hosts could not be reached.

❯ fly vol snapshots list vol_jo4v4mv88keymd0n
Snapshots
ID                      STATUS  SIZE            CREATED AT      RETENTION DAYS
vs_xy7nyZLo16KGiao91x   created 383382372       17 hours ago    5
vs_qwgpwPme9nKli7jxqx   created 383382372       1 day ago       5
vs_wnw3nZNL61KeUpN0px   created 383382372       2 days ago      5

❯ fly vol create data --size 10 --region iad --snapshot-id vs_xy7nyZLo16KGiao91x
                  ID: vol_42gekq8q8nejq37v
                Name: data
              Region: iad
                Zone: d32e
...

Once the volume is ready, destroy the unreachable machine if you haven’t already. And then scale up with fly scale count. Be sure to create the volume with the correct name and in the same region as the replacement machine.

About Machine API changes

You needn’t worry about this section if you only interact with flyctl to control your app and never use the Machine API directly.

Destroying a machine whose host is unreachable requires the same API call as destroying a normal machine. We might still refuse to do so if the host has only recently lost communication, but generally, there isn’t a different API call to make. The only important thing is to set force=true.

When querying machines, we expect GetMachine and ListMachines API calls to return machine information. One piece that always comes with it is the Machine Configuration you sent to create it. For security reasons, we opted to scrub and distribute only part of that configuration across our fleet. The full machine configuration is only kept on the final host where the machine is created.

The bottom line is that when a host goes down, the full machine config becomes inaccessible. The API can return only a partial and incomplete view of it. After a heated discussion where no kittens were harmed, it was decided to return the incomplete configuration under a new field named: incomplete_config (surprise!).

So, in a normal GetMachine request, the machine data looks like this:

// https://api.machines.dev/v1/apps/sleeper/machines/7591857c12836d
{
  "id": "7591857c12836d",
  "name": "red-firefly-4850",
  "state": "started",
  "region": "iad",
  "host_status": "ok",
  "config": {...},
  "incomplete_config": null,
...
}

But when the machine’s host is unreachable, the machine’s config field won’t be present (or will be nil). The best but still partial and incomplete version of it can be found under the incomplete_config field.

// https://api.machines.dev/v1/apps/sleeper/machines/0273d8d7a08915
{
  "id": "0273d8d7a08915",
  "name": "aged-lake-9049",
  "state": "stopped",
  "region": "iad",
  "host_status": "unreachable"
  "config": null,
  "incomplete_config": {...},
...
  },
}

Hopefully, this is clear enough. Don’t hesitate to ask. Happy Flying, Y’all.

10 Likes