LiteFS `promote: true` can cause data loss when scaling horizontally - seeking guidance

mayailurus · January 8, 2026, 5:49pm

Generally speaking, the litefs export / litefs import pair is the main escape hatch, although I’m surprised that the first VACUUM attempt resulted in a low-level panic, …

For clustered databases, in general, a good recovery heuristic is to make a local backup of the database (to your own development laptop/desktop) and then reduce the cluster down to a single node. Preferably, remove the ones that aren’t currently the primary. On the Fly.io platform, in particular, you’ll also want to explicitly destroy the remnant volumes, since those are prone to carry corrupt state over into your next deployment.

As far as I know, vacuuming on /litefs/data (the FUSE mount) is intended to be fully supported. The test suite has a dedicated test of that, if I’m reading it correctly. Perhaps this is a bug triggered only in WAL mode?

https://github.com/superfly/litefs/issues/445

A couple other thoughts/notes…

To be extra conservative, --with-new-volumes can be specified when scaling horizontally. This ensures that you don’t accidentally pick up a corrupt one next time. (By default, fly scale count will reuse any existing volumes that it finds lying around.) As a safety net, you can, moreover, request an immediate snapshot of the primary’s volume, instead of waiting for its 24h cycle to roll around.

Running out of disk space is a difficult situation for all databases… Legacy Postgres had a lot of problems with that as well, from what I’ve heard, and a guardrail was added that drops the cluster into read-only mode when it gets to 90% used (or thereabouts).