It really is a cost-optimization sometimes, —e.g., for an app with a medium-size database and infrequent writes, distributed across several regions.
(This avoids the need to create an entire separate tier of Postgres replicas all over the world. And non-obviously, those db nodes can’t be auto-stopped.)
It also is easier to admin in the steady state, after the initial head-scratching and tinkering period.
In that case, I would suggest using a single primary region with min_machines_running = 2
, and save my other comments for sometime later…
That would keep you on the cobblestone path that flyctl
and Fly Proxy mainly are designed for.
Yeah, that was bad luck with arn
and ams
, now that I think about it.
Try looking for region-specific incidents for phx
instead, to get a more balanced perspective.
Even with Amsterdam’s size bias, here’s what I see in terms of incidents that would materially affect the decision about whether to place LiteFS primaries exclusively there (versus splitting across 2+ regions):
date | effect on existing primaries | summary |
---|---|---|
2025-03-18 | nil | new deployments glitchy (due to registry woes) |
2025-03-17 | nil | new Machines cannot be created |
2025-02-27 | 1/3 | network outage affecting a third of Machines |
2024-10-11 | nil | new Machines cannot be created |
2024-05-08 | nil | new deployments glitchy (due to registry woes) |
2024-05-01 | nil† | new deployments glitchy (due to SSO woes) |
(This skips global outages, of course, since region fine-tuning wouldn’t have helped.)
I.e., one out of the past 12 months, and at 33% odds of actually having been hit.
People with 4-nines style of availability requirements would find it perspicacious to avoid a 33% chance of 2 hours of downtime over the course of a year, certainly, but for others it might not be worth it. Broader disaster-recovery measures (like you were suggesting) might be a better allocation of effort, overall.
They’ve used the term “availability zone” in a looser sense before, for volume hardware independence, but I don’t think entire redundant infrastructure blocks are in the works at all.
(Larger regions like London already span multiple data centers, though, from what I hear.)
†No one wants this many overall incidents, least there was any doubt. The capacity side in particular is a recurring theme in Fly.io’s announcements—and even job postings, lately.