We’ve been looking at building a highly available app with Fly, and so far it has been a very great tool to do so at a good price. I’m struggling however to find a resource online that clarifies how failover works (if at all) in the event a Fly location goes offline.
The above URL clarifies that it’ll go to another region in the event of a failed health check or concurrency limit, so I understand that if a VM goes offline it would be detected. It does not clarify what happens in the event of a total region network outage.
Can someone provide further information on what happens in the instance a network location goes offline? Is failover built in and if so, how fast is it? If not built in, does it just black hole essentially?
This is a nuanced question. There’s usually not one answer! It depends on how the app is deployed for the most part.
In general, here’s what can go wrong in our infrastructure. The first thing to understand is that we run two types of infrastructure – workers that host your vms and edges that accept connections externally.
Edge host failures result in a BGP update. If we lose a whole region and have to remove it from BGP entirely, it could take 30s or so for the internet to route connections around the bad region. This is very rare, but possible.
Worker host failures are the more problematic. If you’re using volumes or have your app running in a single region, a full region outage will make your VMs inaccessible.
If you have VMs running in other regions and there’s a network outage, we’ll happily route around that. If your app allows VMs in other regions, but there aren’t any running, we will eventually try and replace the offline VMs. This could take 10+ minutes. It’s the most brittle part of our recovery process (because it’s a hard problem).