How does failover work with Fly?

max1486 · June 20, 2022, 11:34am

Hello,

We’ve been looking at building a highly available app with Fly, and so far it has been a very great tool to do so at a good price. I’m struggling however to find a resource online that clarifies how failover works (if at all) in the event a Fly location goes offline.

The above URL clarifies that it’ll go to another region in the event of a failed health check or concurrency limit, so I understand that if a VM goes offline it would be detected. It does not clarify what happens in the event of a total region network outage.

Can someone provide further information on what happens in the instance a network location goes offline? Is failover built in and if so, how fast is it? If not built in, does it just black hole essentially?

Thanks!

kurt · June 20, 2022, 9:02pm

This is a nuanced question. There’s usually not one answer! It depends on how the app is deployed for the most part.

In general, here’s what can go wrong in our infrastructure. The first thing to understand is that we run two types of infrastructure – workers that host your vms and edges that accept connections externally.

Edge host failures result in a BGP update. If we lose a whole region and have to remove it from BGP entirely, it could take 30s or so for the internet to route connections around the bad region. This is very rare, but possible.

Worker host failures are the more problematic. If you’re using volumes or have your app running in a single region, a full region outage will make your VMs inaccessible.

If you have VMs running in other regions and there’s a network outage, we’ll happily route around that. If your app allows VMs in other regions, but there aren’t any running, we will eventually try and replace the offline VMs. This could take 10+ minutes. It’s the most brittle part of our recovery process (because it’s a hard problem).

Topic		Replies	Views
High availability on Fly.io Questions / Help	2	1639	December 17, 2021
Load Balancer	4	880	April 20, 2023
Direct network connection to instance or external load balancing?	1	457	October 20, 2021
fly-proxy now routes around unstable network links Fresh Produce	0	420	July 15, 2024
Smarter fly-proxy routing is now available in all regions Fresh Produce	0	204	August 1, 2024

How does failover work with Fly?

Related topics