Region HA Design

ryansch · May 5, 2022, 6:45pm

Do fly regions function more like an individual AWS AZ or AWS region? We have a lot of existing infra running in us-east-1 so I’ve been spinning up resources in fly’s IAD region. Is this a highly available configuration or should I find the “next closest” fly region to place VMs into?

ignoramous · May 5, 2022, 11:18pm

AWS AZs and Regions are a very specific architectural setup where their data-center, networking, and other hardware is tightly married to their devops, netops, and other software. I don’t believe you can map that onto Fly since their architecture is likely radically different.

AWS requires you to think about applications in a way that suit the isolation boundaries they have built. For Fly, I don’t think there exist a comparable guidance (and even if it did, I hear they are in the process of rewriting a bunch of their “platform”, so things are in a state of flux).

To me, because Fly advocates for “app servers closer to users”… multiple VMs spread out among primary and backup regions is the rule of thumb to follow on Fly’s territory.

Keep in mind though, in AWS, some services can tolerate AZ failures (ALB / NLB), while others can tolerate Region failures (CloudFront / Route53), but some cannot even tolerate rack failures (depending on the racks the VMs in their service “cluster” spin up), or disk and RAM failures.

That said, If I really had to, I’d consider one Fly region as analogous to one AWS AZ.

charsleysa · May 6, 2022, 2:06am

I would consider fly regions to function like AWS regions with the difference being there’s not much control over fly AZs yet, the closest you can get is ensuring volumes are spread across AZs which by extension causes your VMs that use volumes to be spread across AZs. An advantage of fly being that it’s really easy to add another region to your app.

ignoramous · May 15, 2022, 5:35am

Due to the nature of its offering, Fly may not have as strictly defined boundaries like AWS does with DC/AZ/Region to limit blast radius, as it were.

Case in point (albeit a point which they have apparently since fixed),

The annoying thing about global consensus is that the operational problems tend to be global as well; we had an outage last night (correlated disk failure on 3 different machines!) in Chicago, and it slowed down deploys all the way to Sydney, essentially because of invariants maintained by a global Raft consensus and fed in part from malfunctioning machines.

- tptacek.

Even at AWS, from what I saw, designing to limit blast radius took shape after long-drawn re-architecture upon re-architecture, over a decade of outage after outage…and even then, it isn’t as perfect (as laid out in the James Hamilton paper I linked to above).

Topic		Replies	Views
Issues in the individual volume page of app in dashboard	3	180	September 24, 2023
AWS Local Zones vs Fly.io	3	448	July 31, 2023
Where is fly hosted?	3	5549	May 26, 2021
Tips for avoiding outages after Apr 27, 2024 incident Questions / Help	10	490	May 6, 2024
Better understanding best practices for HA for both web apps and PG apps Questions / Help	12	1213	May 2, 2023

Region HA Design

Related topics