High availability on Fly.io

Hi, can you explain how Fly.io handles high availability? Are there any guarantees?

  1. What happens when I have multiple nodes in a single region and one of the nodes crashes?
  2. Is there some load balancer that routes traffic to nodes in the same region, and which performs health checks (and removes unhealthy nodes)?
  3. What is the load balancer’s health checks frequency?
  4. How long does it take to remove unhealthy nodes from the routing table?
  5. What happens when I have a single node in each region and one of the regions crashes, is the traffic routed to other regions?
1 Like

The answers to your questions (2, 3, 5) is yes:

  1. Scaling and Autoscaling
  2. Launch: health checks and alerting (help us with pricing model)

Other questions are pointed. Someone from fly’s eng team would know better the edge cases that may affect high-availability (four 9s?) you’re after.

1 Like

Yeah! It’s about what you’d hope, for the most part. I can answer your questions directly, but it probably helps to understand how load balancing is architected.

We have two kinds of hosts: edge and worker. Edge accepts anycast / network traffic, workers run your VMs. When a user makes a request, they connect to the edge proxy, which then forwards the request to a proxy process on the worker host (we call this backhaul). The local worker proxies are responsible for talking to the VMs, enforcing concurrency limits etc.

Health checks and service discovery run out of band. This system keeps track of what VMs are alive and which are passing health checks. Because it’s a big distributed system that spans the world, it takes up to 2 minutes for edge proxies to see VM state changes.

The edge proxy makes a best guess on where to first send a request based on health, load, and latency. These are all eventually consistent, so it’s very common for a request to hit an edge proxy and then get forwarded to a bad VM. This is true even when we fix the service discovery delays – it takes quite a while for a proxy in Sydney to detect changes to a VM in Santiago.

Worker host proxies are the source of truth for a given VM. When an edge proxy forwards a request over backhaul, the worker proxy checks the state of the current VM. Assuming it’s under the concurrency limit and the VM is still running, the worker proxy will forward the request to the VM.

If a worker proxy receives a request for a bad VM (overloaded, gone, etc), it will tell the edge proxy to retry the request. The edge proxy will then pick a new VM and repeat the cycle.

When a VM fails to service a request, we will retry when it’s safe. If the VM hasn’t read the request body, we might retry. We can also retry when the VM sends back a “please retry this request” header.

  1. See above for “single region VM crash” and let me know if I didn’t answer that?
  2. The load balancer doesn’t do active health checks, it relies on service discovery for check state.
  3. Health check frequency is defined in the fly.toml under the services section. I tend to run health checks every 1-5s for my own apps.
  4. It takes up to 2 minutes for our edges to see that a VM has changed state. But once a VM stops accepting connections, we retry immediately.
  5. We call this latency shedding, usually. When we can’t service a request in one region due to health checks or concurrency limits, we send them to the next closest region.

Does that help?