Region HA Design

Do fly regions function more like an individual AWS AZ or AWS region? We have a lot of existing infra running in us-east-1 so I’ve been spinning up resources in fly’s IAD region. Is this a highly available configuration or should I find the “next closest” fly region to place VMs into?

AWS AZs and Regions are a very specific architectural setup where their data-center, networking, and other hardware is tightly married to their devops, netops, and other software. I don’t believe you can map that onto Fly since their architecture is likely radically different.

AWS requires you to think about applications in a way that suit the isolation boundaries they have built. For Fly, I don’t think there exist a comparable guidance (and even if it did, I hear they are in the process of rewriting a bunch of their “platform”, so things are in a state of flux).

To me, because Fly advocates for “app servers closer to users”… multiple VMs spread out among primary and backup regions is the rule of thumb to follow on Fly’s territory.

Keep in mind though, in AWS, some services can tolerate AZ failures (ALB / NLB), while others can tolerate Region failures (CloudFront / Route53), but some cannot even tolerate rack failures (depending on the racks the VMs in their service “cluster” spin up), or disk and RAM failures.

That said, If I really had to, I’d consider one Fly region as analogous to one AWS AZ.

I would consider fly regions to function like AWS regions with the difference being there’s not much control over fly AZs yet, the closest you can get is ensuring volumes are spread across AZs which by extension causes your VMs that use volumes to be spread across AZs. An advantage of fly being that it’s really easy to add another region to your app.

Due to the nature of its offering, Fly may not have as strictly defined boundaries like AWS does with DC/AZ/Region to limit blast radius, as it were.

Case in point (albeit a point which they have apparently since fixed),

The annoying thing about global consensus is that the operational problems tend to be global as well; we had an outage last night (correlated disk failure on 3 different machines!) in Chicago, and it slowed down deploys all the way to Sydney, essentially because of invariants maintained by a global Raft consensus and fed in part from malfunctioning machines.

- tptacek.

Even at AWS, from what I saw, designing to limit blast radius took shape after long-drawn re-architecture upon re-architecture, over a decade of outage after outage…and even then, it isn’t as perfect (as laid out in the James Hamilton paper I linked to above).