Self-healing machine state synchronization and service discovery

(This post is pretty heavy on the technical internals of Fly.io; we’re writing a blog post around the same topic as well)

At Fly.io, we run a platform hosting hundreds of thousands of Fly Machines. One of the core challenges is how to manage the state of all of these machines. If you have followed some of our previous platform-wide incidents before, almost every single one of them had lasting effects caused by some kind of state inconsistency for machines. This is why, over the past months, we have been working on improvements in this aspect. This post is the latest update from this line of work, and is mainly about how we now synchronize machine state and service information into Corrosion, our service discovery mechanism, making it much more resilient to inconsistencies.

Flyd and Corrosion

Before I get into the nitty-gritty of how this works, a quick refresher on how the Fly.io platform is architected. Each bare metal host we operate runs a daemon called flyd, which oversees all machines that landed on that host. flyd maintains an internal database of all machines it manages, and is considered the source of truth when it comes to questions like “what is this machine doing right now?”. The machines API, for example, talks directly with flyd on each host to acquire a machine’s latest state.

Of course, it is impossible, or at least would be a really annoying problem, to keep this state local to each flyd instance. For example, almost all routing features of fly-proxy relies on information about machines that we can query fast: how many machines are running, how many are in the correct process group to serve this port, and so on. We can’t make an RPC round-trip to all hosts with machines for an app every time we need to handle an incoming connection!

This is where Corrosion comes in. It is a distributed, eventually-consistent sqlite-based database that we run on every host. Whenever flyd makes a change to a machine’s state, for example, starting or stopping it (or observing it to exit), it issues a Corrosion write request to do the same for a corresponding record in Corrosion. Corrosion replicates the same state to every host, with mostly millisecond-level delays, and fly-proxy now has a locally cached SQLite database to query quickly when serving requests. In this way, Corrosion acts almost like a distributed cache that provides blazing-fast local SQLite access to information about every machine hosted on Fly.io.

State Synchronization is Hard

There are a couple of problems with storing a copy of every machine’s state in Corrosion, though. An obvious one is that Corrosion must scale at least with the total number of machines. The global cluster also acts as sort of a “single point of failure”, and that is being addressed by a project we call “regionalization”. This post is about the other major problem: how to keep the state stored in Corrosion actually in sync with the source of truth.

The way flyd writes into Corrosion is simple. Every time it changes something, it does exactly the same change on Corrosion. This comes with a hidden assumption though: it assumes every action before it has been committed correctly. Should any of them fail, get rolled back, or be overwritten by another host (for example, while a machine migrates to another host), future writes will be operating on something that is already inconsistent.

This is all very abstract so here’s a concrete example: every machine defines a set of ports as “services” and that’s how fly-proxy knows to route requests to a certain port on your machine. These “services” are registered (written into Corrosion) when a machine is first created. If that fails, got overwritten, or got lost somehow, flyd will not know about it and will happily write updates to other parts of the machine’s state. However, in reality, all of that is now useless since the machine is completely invisible to fly-proxy and cannot serve any request.

Ideally, none of this should happen. Why would a record that was written to disappear, or get overwritten? Indeed, when this happens, it usually means other parts of our system is going wrong one way or another. For example, during a few past incidents related to Corrosion, we had to restore Corrosion from backups[1]. That means we lost some writes that happened in between when the backup was taken and when the restoration happened. We’ve built a “re-sync” feature into flyd for this purpose: it tries to re-do any writes that might be missed. But, this feature has always been an afterthought: it frequently lags behind development of flyd itself and misses new features added to machines and services’ states. This has hit us hard during incidents before.

Making State Synchronization Robust

After the last incident where we got bitten hard by state inconsistencies, we discussed whether we should just run the re-sync command we have built periodically. We quickly realized that might be a problem of its own: other than its lagging feature compatibility with the rest of flyd, it is also inherently racy, because the rest of flyd is not designed with the re-sync command in mind. It might cause the state to be overwritten unexpectedly instead of fixing it, if we just run the existing command on a loop.

There is also state that cannot be re-synchronized properly. For example, the fly machine cordon command (or the machine API counterpart) used to just delete all services registered by a machine from Corrosion. There’s no easy way for us to look at a machine and decide whether it has been cordoned or not. This API is also used by bluegreen deployments, and is critical in preventing requests from being routed to green machines that aren’t yet ready. Bluegreen deployments also used to generate additional healthchecks to work around this “cordoning” limitation (it also removes any health checks that comes with your machine!), and a synchronization loop will likely delete them since they don’t “look” like they belong to a machine.

The core of the problem here is that there is no clearly-defined “target state” for each machine to be synchronized. Trying to get a synchronization loop working without addressing this first is going to quickly turn into a mess. So, we decided to attack this from another angle: what if every write from flyd is a full reconciliation? That is, every time anything about a machine changes, we send the full, up-to-date state to Corrosion, not just one single column for what is updated. Corrosion does not generate spurious gossip messages for no-op changes, so the worst this costs us is some extra SQL parsing / execution and data transfer on lo, which is negligible compared to everything else flyd has to handle. Of course, this still depends on us fixing cordoning and bluegreen deployments. So we have reworked them as well.

As of today, we have mostly completed this work. Ad-hoc writes have been replaced with triggering a synchronization loop iteration for that machine. Some final touch-ups are still in the process of being rolled out, such as reconciling services periodically.

What Does This Mean For You

The most important goal of this is that, should we run into a situation where Corrosion needs to be restored (or “reseeded”) ever again, there should not be any lasting effects visible to you, the customer. You should not be seeing funky routing behavior hours after we have completed the process. When every single state update is a reconciliation, we’re exercising that code path every time you do anything with a machine, and nothing should really be that special should we need to re-synchronize everything.

This also helps address silent inconsistencies that arise from subtle timing issues. We have seen reports where a machine can get stuck in weird states according to the proxy, machines API or Corrosion, but in reality it is working just fine. Because now every state update only triggers an iteration in the reconciliation loop, and that loop will send the full, up-to-date state to Corrosion atomically, there is much less of a chance for this to go wrong only partially. Even if something did happen unexpectedly, another iteration will fix it up very quickly (hence “self-healing”).

The rework of “cordoning” has also enabled us to improve on how autosuspend works. This is not yet fully completed, but you should see another Fresh Produce on this pretty soon! Spoiler: it is about how the proxy “drains” connections off a machine that’s about to be suspended. Right now, our answer is “we do nothing”, and that can cause failed connections.

Speaking of new features, there is at least another upcoming feature that’s dependent (or rather, made much simpler) by a state reconciliation loop that we run periodically. Feel free to guess what that will be, and stay tuned for more updates from us!


  1. This can also happen if there is a breaking change in Corrosion or its schema; it’s rare but we do occasionally need to do it ↩︎

4 Likes

Is it drain control?