LiteFS recommended production configuration for availability/durability

It really is a cost-optimization sometimes, :black_cat:—e.g., for an app with a medium-size database and infrequent writes, distributed across several regions.

(This avoids the need to create an entire separate tier of Postgres replicas all over the world. And non-obviously, those db nodes can’t be auto-stopped.)

It also is easier to admin in the steady state, after the initial head-scratching and tinkering period.

In that case, I would suggest using a single primary region with min_machines_running = 2, and save my other comments for sometime later…

That would keep you on the cobblestone path that flyctl and Fly Proxy mainly are designed for.

Yeah, that was bad luck with arn and ams, now that I think about it.

Try looking for region-specific incidents for phx instead, to get a more balanced perspective.

Even with Amsterdam’s size bias, here’s what I see in terms of incidents that would materially affect the decision about whether to place LiteFS primaries exclusively there (versus splitting across 2+ regions):

date effect on
existing
primaries
summary
2025-03-18 nil new deployments glitchy (due to registry woes)
2025-03-17 nil new Machines cannot be created
2025-02-27 1/3 network outage affecting a third of Machines
2024-10-11 nil new Machines cannot be created
2024-05-08 nil new deployments glitchy (due to registry woes)
2024-05-01 nilnew deployments glitchy (due to SSO woes)

(This skips global outages, of course, since region fine-tuning wouldn’t have helped.)

I.e., one out of the past 12 months, and at 33% odds of actually having been hit.

People with 4-nines style of availability requirements would find it perspicacious to avoid a 33% chance of 2 hours of downtime over the course of a year, certainly, but for others it might not be worth it. Broader disaster-recovery measures (like you were suggesting) might be a better allocation of effort, overall.

They’ve used the term “availability zone” in a looser sense before, for volume hardware independence, but I don’t think entire redundant infrastructure blocks are in the works at all.

(Larger regions like London already span multiple data centers, though, from what I hear.)


†No one wants this many overall incidents, least there was any doubt. The capacity side in particular is a recurring theme in Fly.io’s announcements—and even job postings, lately.

1 Like