LiteFS recommended production configuration for availability/durability

mayailurus · May 1, 2025, 12:02am

It really is a cost-optimization sometimes, —e.g., for an app with a medium-size database and infrequent writes, distributed across several regions.

(This avoids the need to create an entire separate tier of Postgres replicas all over the world. And non-obviously, those db nodes can’t be auto-stopped.)

It also is easier to admin in the steady state, after the initial head-scratching and tinkering period.

In that case, I would suggest using a single primary region with min_machines_running = 2, and save my other comments for sometime later…

That would keep you on the cobblestone path that flyctl and Fly Proxy mainly are designed for.

Yeah, that was bad luck with arn and ams, now that I think about it.

Try looking for region-specific incidents for phx instead, to get a more balanced perspective.

Even with Amsterdam’s size bias, here’s what I see in terms of incidents that would materially affect the decision about whether to place LiteFS primaries exclusively there (versus splitting across 2+ regions):

date	effect on existing primaries	summary
2025-03-18	nil	new deployments glitchy (due to registry woes)
2025-03-17	nil	new Machines cannot be created
2025-02-27	1/3	network outage affecting a third of Machines
2024-10-11	nil	new Machines cannot be created
2024-05-08	nil	new deployments glitchy (due to registry woes)
2024-05-01	nil†	new deployments glitchy (due to SSO woes)

(This skips global outages, of course, since region fine-tuning wouldn’t have helped.)

I.e., one out of the past 12 months, and at 33% odds of actually having been hit.

People with 4-nines style of availability requirements would find it perspicacious to avoid a 33% chance of 2 hours of downtime over the course of a year, certainly, but for others it might not be worth it. Broader disaster-recovery measures (like you were suggesting) might be a better allocation of effort, overall.

They’ve used the term “availability zone” in a looser sense before, for volume hardware independence, but I don’t think entire redundant infrastructure blocks are in the works at all.

(Larger regions like London already span multiple data centers, though, from what I hear.)

†No one wants this many overall incidents, least there was any doubt. The capacity side in particular is a recurring theme in Fly.io’s announcements—and even job postings, lately.

Topic		Replies	Views
LiteFS: Two (primary) volumes in one region	6	622	April 5, 2023
Understanding litefs for "rarely up" architecture Questions / Help litefs	7	516	October 21, 2023
Error with Read-Only Replica Configuration in LiteFS Cloud Migration to Litestream + Tigris machines , litefs	2	90	October 17, 2024
Zero downtime deployments with LiteFS Questions / Help litefs	3	461	September 30, 2023
Define Primary Region for LiteFS litefs	4	749	October 24, 2022

LiteFS recommended production configuration for availability/durability

Related topics