The Problem
We discovered a dangerous edge case when scaling a LiteFS-backed application horizontally. When promote: true is set in litefs.yml, adding a new machine can result in complete data loss if the new machine (with an empty database) wins the primary election before syncing from the existing primary.
Our Setup
# litefs.yml
lease:
type: 'consul'
candidate: ${FLY_REGION == PRIMARY_REGION}
promote: true
advertise-url: 'http://${HOSTNAME}.vm.${FLY_APP_NAME}.internal:20202'
consul:
url: '${FLY_CONSUL_URL}'
key: 'myapp-litefs/${FLY_APP_NAME}'
exec:
- cmd: npx prisma migrate deploy
if-candidate: true
- cmd: npm start
What Happened
- We had a single machine running with production data
- Ran
fly scale count 2 --region sjcto add a replica - The new machine became primary before syncing from the existing node
- Its empty database replicated to the original machine
- All production data was lost
The Root Cause
From the LiteFS documentation:
promote: If true and the node is a candidate, it will automatically try to become primary after connecting to the cluster.
With promote: true, the new machine immediately tries to become primary. If it wins the election before receiving replica data, it pushes its empty database to all other nodes.
The Workaround
We found a workaround by toggling the promote setting:
| Operation | promote Setting |
Result |
|---|---|---|
| Normal deployment | true |
Migrations run on primary |
| Horizontal scaling | false |
New nodes join as replicas safely |
To scale safely:
- Set
promote: falseinlitefs.yml - Deploy:
fly deploy - Scale:
fly scale count 2 --region sjc - Set
promote: trueback inlitefs.yml - Deploy again:
fly deploy
The Trade-off
With promote: false, migrations fail on replicas because:
if-candidate: truechecks if a node can become primary, not if it is primary- Both machines in the same region are candidates
- Replicas cannot write, so
npx prisma migrate deployfails with:fuse: write(): wal error: read only replica
Questions for the Community
-
Is there a better approach? Toggling
promotefor every scale operation seems error-prone. -
Should there be an
if-primaryoption? The currentif-candidatedoesn’t distinguish between the actual primary and other candidates. -
Is there a way to ensure new nodes sync before attempting primary election? Something like a “sync-first” mode?
-
Are others using a different pattern for horizontal scaling with LiteFS?
Environment
- LiteFS with Consul lease
- SQLite + Prisma
- Fly.io deployment
- Multiple machines in same region
Any guidance would be appreciated. We’re happy to contribute to documentation if there’s an established best practice we’re missing.