LiteFS recommended production configuration for availability/durability

Is there a minimal recommended setup for running LiteFS in production?

Ideally, I’d like to achieve:

  • Automatic failover on host/disk failure
    • With data loss of up to replication lag.
  • Automatic failover on region outage
    • With data loss of up to replication lag.
    • It seems that regional network outages are somewhat common?
  • Zero downtime deployment
    • I’m assuming there is some mechanism to ensure all data is replicated before switching the primary during deployment.

Would running two always-on machines in two different regions with promote: true be enough to achieve this?

If I use two nearby regions like ARN and AMS, I’m assuming the replication lag would generally be under 1 second?

2 Likes

Hi… Yep, there are various official and unofficial recommendations, and they’re all pretty close to what you sketched out above…

The “automatic failover on region outage” that you mentioned is going beyond the implicit minimum† in that older thread, though, and for that you would need primary-candidates in multiple regions.

In my (super-unofficial) view, the “write latency shouldn’t change just because your primary does” aspect is a minor concern, so I would not shy that much away from having two primary-candidates in two regions, like the proposed arn + ams. The main disadvantage here is that it’s inconvenient to keep them both always running (which you do need, as you noted) via the Fly.io platform’s default orchestrator: that software is relatively simple-minded compared to what you’re attempting, and it only thinks in terms of a single region being primary—not a more general collection of nodes or locations.

Briefly, another orchestration nuance to be generally aware of is how read-only nodes behave when the primary is unreachable. At least one user reported loss of read availability, not just write availability, possibly due to the following policy change:

https://community.fly.io/t/postmortem-fly-registry-2023-08-08/14744

This rules out fallback scenarios that people intuitively assume would work.

Not really, although some long-time LiteFS users do recommend having at least one node in a different region, all the same.

(You can browse the Infrastructure Log (“100% fidelity to internal incidents”) to evaluate how often a given region really goes down in a given year.)

(That probability could definitely still be more than your own, specific app could tolerate, of course.)

I don’t run services that have that strict of constraints, myself, but the official docs mention a bound of a “few milliseconds of write availability loss”. That and the following thread are the only potential gotchas that I know of:

(Note that rolling is the only option when you have volumes, apart from the rather abrupt immediate mode.)

You can use the all-pairs round-trip times (RTT) tables to get an idea of how much added latency that would be. That’s currently showing 23ms for arnams.

I wouldn’t expect that to cause much problems with replication lag, but it’s always wise to test ahead of time. Also, do keep the differences in bandwidth billing in mind, in general, :dragon:.

Anyway, this didn’t address all of your questions, but hopefully it still helps a little!


†That is a Postgres thread, strictly speaking, but it shows Fly.io’s overall thinking on this topic, particularly in the final paragraph.

2 Likes

Thanks for the breakdown and links. I’m just starting to scratch the surface with Fly and LiteFS, so it’s reassuring to hear I’m at least directionally correct. I haven’t fully grasped all the implications of this setup yet, but from what I can tell, it looks like it could offer a lot of value per dollar for the right kind of application. I’ll keep digging deeper.

My availability requirements aren’t super strict. If deployments take ten seconds I can probably just plan around that but it would be nice not to think about it. Same goes for failover. I could likely get by with manually restoring from Litestream if I had to, though of course it’d be nice to avoid thinking about it at all.

For regional outages. Maybe it’s just recency bias, but scrolling through the incident history, I noticed partial outages in ARN and AMS and at least one full outage in the last few months. From my understanding fly doesn’t have Availability Zones within regions?

Thanks again

It really is a cost-optimization sometimes, :black_cat:—e.g., for an app with a medium-size database and infrequent writes, distributed across several regions.

(This avoids the need to create an entire separate tier of Postgres replicas all over the world. And non-obviously, those db nodes can’t be auto-stopped.)

It also is easier to admin in the steady state, after the initial head-scratching and tinkering period.

In that case, I would suggest using a single primary region with min_machines_running = 2, and save my other comments for sometime later…

That would keep you on the cobblestone path that flyctl and Fly Proxy mainly are designed for.

Yeah, that was bad luck with arn and ams, now that I think about it.

Try looking for region-specific incidents for phx instead, to get a more balanced perspective.

Even with Amsterdam’s size bias, here’s what I see in terms of incidents that would materially affect the decision about whether to place LiteFS primaries exclusively there (versus splitting across 2+ regions):

date effect on
existing
primaries
summary
2025-03-18 nil new deployments glitchy (due to registry woes)
2025-03-17 nil new Machines cannot be created
2025-02-27 1/3 network outage affecting a third of Machines
2024-10-11 nil new Machines cannot be created
2024-05-08 nil new deployments glitchy (due to registry woes)
2024-05-01 nilnew deployments glitchy (due to SSO woes)

(This skips global outages, of course, since region fine-tuning wouldn’t have helped.)

I.e., one out of the past 12 months, and at 33% odds of actually having been hit.

People with 4-nines style of availability requirements would find it perspicacious to avoid a 33% chance of 2 hours of downtime over the course of a year, certainly, but for others it might not be worth it. Broader disaster-recovery measures (like you were suggesting) might be a better allocation of effort, overall.

They’ve used the term “availability zone” in a looser sense before, for volume hardware independence, but I don’t think entire redundant infrastructure blocks are in the works at all.

(Larger regions like London already span multiple data centers, though, from what I hear.)


†No one wants this many overall incidents, least there was any doubt. The capacity side in particular is a recurring theme in Fly.io’s announcements—and even job postings, lately.

1 Like