High availability on Fly.io

Yeah! It’s about what you’d hope, for the most part. I can answer your questions directly, but it probably helps to understand how load balancing is architected.

We have two kinds of hosts: edge and worker. Edge accepts anycast / network traffic, workers run your VMs. When a user makes a request, they connect to the edge proxy, which then forwards the request to a proxy process on the worker host (we call this backhaul). The local worker proxies are responsible for talking to the VMs, enforcing concurrency limits etc.

Health checks and service discovery run out of band. This system keeps track of what VMs are alive and which are passing health checks. Because it’s a big distributed system that spans the world, it takes up to 2 minutes for edge proxies to see VM state changes.

The edge proxy makes a best guess on where to first send a request based on health, load, and latency. These are all eventually consistent, so it’s very common for a request to hit an edge proxy and then get forwarded to a bad VM. This is true even when we fix the service discovery delays – it takes quite a while for a proxy in Sydney to detect changes to a VM in Santiago.

Worker host proxies are the source of truth for a given VM. When an edge proxy forwards a request over backhaul, the worker proxy checks the state of the current VM. Assuming it’s under the concurrency limit and the VM is still running, the worker proxy will forward the request to the VM.

If a worker proxy receives a request for a bad VM (overloaded, gone, etc), it will tell the edge proxy to retry the request. The edge proxy will then pick a new VM and repeat the cycle.

When a VM fails to service a request, we will retry when it’s safe. If the VM hasn’t read the request body, we might retry. We can also retry when the VM sends back a “please retry this request” header.

  1. See above for “single region VM crash” and let me know if I didn’t answer that?
  2. The load balancer doesn’t do active health checks, it relies on service discovery for check state.
  3. Health check frequency is defined in the fly.toml under the services section. I tend to run health checks every 1-5s for my own apps.
  4. It takes up to 2 minutes for our edges to see that a VM has changed state. But once a VM stops accepting connections, we retry immediately.
  5. We call this latency shedding, usually. When we can’t service a request in one region due to health checks or concurrency limits, we send them to the next closest region.

Does that help?