LiteFS `promote: true` can cause data loss when scaling horizontally - seeking guidance

The Problem

We discovered a dangerous edge case when scaling a LiteFS-backed application horizontally. When promote: true is set in litefs.yml, adding a new machine can result in complete data loss if the new machine (with an empty database) wins the primary election before syncing from the existing primary.

Our Setup

# litefs.yml
lease:
  type: 'consul'
  candidate: ${FLY_REGION == PRIMARY_REGION}
  promote: true
  advertise-url: 'http://${HOSTNAME}.vm.${FLY_APP_NAME}.internal:20202'

  consul:
    url: '${FLY_CONSUL_URL}'
    key: 'myapp-litefs/${FLY_APP_NAME}'

exec:
  - cmd: npx prisma migrate deploy
    if-candidate: true
  - cmd: npm start

What Happened

  1. We had a single machine running with production data
  2. Ran fly scale count 2 --region sjc to add a replica
  3. The new machine became primary before syncing from the existing node
  4. Its empty database replicated to the original machine
  5. All production data was lost

The Root Cause

From the LiteFS documentation:

promote: If true and the node is a candidate, it will automatically try to become primary after connecting to the cluster.

With promote: true, the new machine immediately tries to become primary. If it wins the election before receiving replica data, it pushes its empty database to all other nodes.

The Workaround

We found a workaround by toggling the promote setting:

Operation promote Setting Result
Normal deployment true Migrations run on primary
Horizontal scaling false New nodes join as replicas safely

To scale safely:

  1. Set promote: false in litefs.yml
  2. Deploy: fly deploy
  3. Scale: fly scale count 2 --region sjc
  4. Set promote: true back in litefs.yml
  5. Deploy again: fly deploy

The Trade-off

With promote: false, migrations fail on replicas because:

  • if-candidate: true checks if a node can become primary, not if it is primary
  • Both machines in the same region are candidates
  • Replicas cannot write, so npx prisma migrate deploy fails with: fuse: write(): wal error: read only replica

Questions for the Community

  1. Is there a better approach? Toggling promote for every scale operation seems error-prone.

  2. Should there be an if-primary option? The current if-candidate doesn’t distinguish between the actual primary and other candidates.

  3. Is there a way to ensure new nodes sync before attempting primary election? Something like a “sync-first” mode?

  4. Are others using a different pattern for horizontal scaling with LiteFS?

Environment

  • LiteFS with Consul lease
  • SQLite + Prisma
  • Fly.io deployment
  • Multiple machines in same region

Any guidance would be appreciated. We’re happy to contribute to documentation if there’s an established best practice we’re missing.

1 Like

Yikes… Sorry to hear that. If you guys didn’t know about the volume snapshots already, I’d suggest pouncing on those as early as possible. They age-off (i.e., auto-delete) relatively quickly.

My understanding is that this shouldn’t happen. From another part of the official docs:

The recommended way to run migrations when you deploy is to have candidate nodes automatically promote themselves after they’ve connected to the cluster and are sync’d up.

The context implies that promote: true is sufficient. (I.e., that it already includes the sync-first you were proposing.)

A production LiteFS cluster should generally have ≥2 primary-candidate Machines always running, but it should be safe to add a new one at any time, without going through contortions.

I’d suggest posting the full fly.toml, litefs.yml, litefs mount invocation, LiteFS version, and as much of the logs as you can scrounge from Grafana, etc. LiteFS is very chatty during startup, and it should say when it was contacting the primary, when it was replicating, …

(When I tried to reproduce the issue here, I saw “snapshot received” on the new node strictly before it transitioned to “node is a candidate, automatically promoting to primary”. I.e., it did synch first. It looks like it doesn’t really even consider itself to be “connected to the cluster” until it gets that snapshot—which does stand to reason.)

1 Like

Thanks for the reply and for sharing your findings when testing this.

First, yes - we did recover from snapshots. Lesson learned the hard way!

You’re right that based on the docs, promote: true should sync first. That’s what makes this confusing. Here are the full configs as requested:

fly.toml

app = "launchfast-pro-7369"
primary_region = "sjc"
kill_signal = "SIGINT"
kill_timeout = 5
processes = [ ]
swap_size_mb = 512

[experimental]
allowed_public_ports = [ ]
auto_rollback = true

[mounts]
source = "data"
destination = "/data"

[[services]]
internal_port = 8_080
processes = [ "app" ]
protocol = "tcp"
script_checks = [ ]

  [services.concurrency]
  hard_limit = 100
  soft_limit = 80
  type = "requests"

  [[services.ports]]
  handlers = [ "http" ]
  port = 80
  force_https = true

  [[services.ports]]
  handlers = [ "tls", "http" ]
  port = 443

  [[services.tcp_checks]]
  grace_period = "1s"
  interval = "15s"
  restart_limit = 0
  timeout = "2s"

  [[services.http_checks]]
  interval = "10s"
  grace_period = "5s"
  method = "get"
  path = "/resources/healthcheck"
  protocol = "http"
  timeout = "2s"
  tls_skip_verify = false
  headers = { }

  [[services.http_checks]]
  grace_period = "10s"
  interval = "30s"
  method = "GET"
  timeout = "5s"
  path = "/litefs/health"

litefs.yml

# Documented example: https://github.com/superfly/litefs/blob/dec5a7353292068b830001bd2df4830e646f6a2f/cmd/litefs/etc/litefs.yml
fuse:
  # Required. This is the mount directory that applications will
  # use to access their SQLite databases.
  dir: '${LITEFS_DIR}'

data:
  # Path to internal data storage.
  dir: '/data/litefs'

proxy:
  # matches the internal_port in fly.toml
  addr: ':${INTERNAL_PORT}'
  target: 'localhost:${PORT}'
  db: '${DATABASE_FILENAME}'

# The lease section specifies how the cluster will be managed. We're using the
# "consul" lease type so that our application can dynamically change the primary.
#
# These environment variables will be available in your Fly.io application.
lease:
  type: 'consul'
  candidate: ${FLY_REGION == PRIMARY_REGION}
  promote: true
  advertise-url: 'http://${HOSTNAME}.vm.${FLY_APP_NAME}.internal:20202'

  consul:
    url: '${FLY_CONSUL_URL}'
    key: 'launchfast-litefs/${FLY_APP_NAME}'

exec:
  - cmd: npx prisma migrate deploy
    if-candidate: true

  # Set the journal mode for the database to WAL. This reduces concurrency deadlock issues
  - cmd: sqlite3 $DATABASE_PATH "PRAGMA journal_mode = WAL;"
    if-candidate: true

  # Set the journal mode for the cache to WAL. This reduces concurrency deadlock issues
  - cmd: sqlite3 $CACHE_DATABASE_PATH "PRAGMA journal_mode = WAL;"
    if-candidate: true

  - cmd: npm start

Dockerfile (relevant parts)

FROM node:20-bookworm-slim as base

# ... build stages ...

# Final stage
ENV LITEFS_DIR="/litefs/data"
ENV DATABASE_FILENAME="sqlite.db"
ENV DATABASE_PATH="$LITEFS_DIR/$DATABASE_FILENAME"
ENV DATABASE_URL="file:$DATABASE_PATH"
ENV INTERNAL_PORT="8080"
ENV PORT="8081"

# LiteFS setup
COPY --from=flyio/litefs:0.5.11 /usr/local/bin/litefs /usr/local/bin/litefs
ADD other/litefs.yml /etc/litefs.yml
RUN mkdir -p /data ${LITEFS_DIR}

CMD ["litefs", "mount"]

LiteFS Version

LiteFS v0.5.11, commit=63eab529dc3353e8d159e097ffc4caa7badb8cb3

Logs

I managed to recover the logs from the incident. Here’s the critical sequence that shows what happened:

Context: We had two machines:

  • d891265a6e4738 - Started with data, later lost it
  • e82d390b70e258 - Was the primary with data

We had some LiteFS checksum issues after running VACUUM directly on the raw database file (bypassing LiteFS). To fix it, we deleted the LTX files and the sqlite.db directory on d891265a6e4738. That’s when things went wrong.

The Data Loss Event (Jan 7, 13:39):

Machine d891265a6e4738 started with an empty database directory and became primary:

13:39:39 level=INFO msg="initializing consul: key=epic-stack-litefs/launchfast-pro-7369 url=https://:xxx@consul-iad-9.fly-shared.net/launchfast-pro-7369-vxemq2wmowmqgz63/ hostname=d891265a6e4738 advertise-url=http://d891265a6e4738.vm.launchfast-pro-7369.internal:20202"
13:39:39 level=INFO msg="wal-sync: short wal file exists on \"cache.db\", skipping sync with ltx"
13:39:39 level=INFO msg="database file is zero length on initialization: /data/litefs/dbs/sqlite.db/database"
13:39:39 level=INFO msg="using existing cluster id: \"LFSCC41BBA6865B68D4C\""
13:39:39 level=INFO msg="LiteFS mounted to: /litefs/data"
13:39:39 level=INFO msg="http server listening on: http://localhost:20202"
13:39:39 level=INFO msg="waiting to connect to cluster"
13:39:39 level=INFO msg="583B11148A88112A: existing primary found (e82d390b70e258), connecting as replica to \"http://e82d390b70e258.vm.launchfast-pro-7369.internal:20202\""
13:39:39 level=INFO msg="connected to cluster, ready"
13:39:39 level=INFO msg="node is a candidate, automatically promoting to primary"
13:39:39 level=INFO msg="583B11148A88112A: disconnected from primary, retrying"
13:39:39 level=INFO msg="583B11148A88112A: acquiring existing lease from handoff"
13:39:40 level=INFO msg="583B11148A88112A: primary lease acquired, advertising as http://d891265a6e4738.vm.launchfast-pro-7369.internal:20202"
13:39:40 level=INFO msg="proxy server listening on: http://localhost:8080"
13:39:40 level=INFO msg="executing command: npx [prisma migrate deploy]"

Key observations from the logs:

  1. database file is zero length on initialization - LiteFS knew the database was empty
  2. existing primary found (e82d390b70e258), connecting as replica - It found the real primary with data
  3. connected to cluster, ready - It said “ready” but there’s no “snapshot received” message
  4. node is a candidate, automatically promoting to primary - Immediately tried to promote
  5. disconnected from primary, retrying - Brief disconnection
  6. acquiring existing lease from handoff - Got the lease via handoff (not by waiting!)
  7. primary lease acquired - Now it’s primary with an empty database

Then it ran all migrations on the empty database:

13:39:46 level=INFO msg="starting from txid 0000000000000001, writing snapshot"
13:39:46 level=INFO msg="writing snapshot \"sqlite.db\" @ 0000000000000001"
13:39:46 Applying migration `20230914194400_init`
13:39:47 Applying migration `20240718074729_add_payments`
[... all 12 migrations applied to empty DB ...]
13:39:47 All migrations have been successfully applied.

And the other machine synced the empty database:

13:39:44 level=INFO msg="583B11148A88112A: stream connected ([fdaa:2:2f43:a7b:16b:ccd:65d7:2]:44188)"

Confirmation of data loss on the former primary:

root@e82d390b70e258:/myapp# sqlite3 /data/litefs/dbs/sqlite.db/database "SELECT COUNT(*) FROM User;"
0

What Went Wrong

Looking at the logs, the critical issue is that there’s no “snapshot received” message before “connected to cluster, ready”. In your test, you saw:

“snapshot received” on the new node strictly before it transitioned to “node is a candidate, automatically promoting to primary”

But in our logs, the sequence was:

  1. “connected to cluster, ready”
  2. “node is a candidate, automatically promoting to primary”
  3. “disconnected from primary, retrying”
  4. “acquiring existing lease from handoff”

It seems like the empty node connected, immediately tried to promote, got disconnected briefly, then acquired the lease via “handoff” from the existing primary - all without ever receiving a snapshot.

Possible Contributing Factors

  1. Zero-length database file - The node started with /data/litefs/dbs/sqlite.db/database being zero bytes (we had deleted the directory). Maybe LiteFS handles this differently than a node with no database at all?

  2. Deleted LTX files - Before this, we had deleted the LTX files on both machines to fix checksum errors. This may have broken the replication state.

  3. “Handoff” behavior - The log shows acquiring existing lease from handoff. This suggests the existing primary (e82d390b70e258) actively handed off the lease to the empty node. Why would it do that?

  4. No snapshot sync before promotion - The key missing piece is “snapshot received”. The node marked itself as “connected to cluster, ready” without actually having data.

Questions

  1. Is the “handoff” behavior expected? Should the existing primary hand off the lease to a node that hasn’t received a snapshot?

  2. Does starting with a zero-length database file bypass the normal sync-first behavior?

  3. Should promote: true check that the local database is in sync before attempting promotion?

Happy to provide any additional logs or run diagnostic commands if that helps track this down.

2 Likes

Thanks for the additional details… Yeah, manually intervening on /data/litefs is mentioned multiple times in the forum archives as causing serious (and confusing) problems. Combined with the “short WAL file”, i.e., premature EOF, message in the logs, I think that makes it the current prime suspect (as it were).

Ideally there would be better guardrails around things like this, and maybe the official docs should name the example low-level mountpoint /var/lib/never-touch-this-directly/ or similar.

I’ll try a “smoking gun” reproduction over the next few days, using this new information, but my guess is that it’s not a state that you can reach during normal operations…


Aside: for extra peace of mind, though, you might want to look into LiteFS Backup, which is a streaming-backups system that is compatible with Consul leasing, generously made available under an open source license by a fellow user.

(It provides point-in-time restores, and not just daily snapshots.)

Thanks for looking into this, and for the tip about LiteFS Backup - I’ll definitely check that out.

You’re right that manually touching /data/litefs was the root cause. Here’s the full sequence of events that led us there:

1. Volume filled up (Jan 6, 19:20)

We had a bug that created ~3.9 million rows in an ABTestUser table, bloating the database to ~753MB. The 1GB volume filled up:

19:20:00 level=INFO msg="fuse: write(): wal error: wal frame data: write /data/litefs/dbs/sqlite.db/wal: no space left on device"

The machine started restart-looping because it couldn’t write.

2. Extended volumes on both machines (Jan 7)

We extended both volumes from 1GB to 8GB using fly volumes extend. This allowed the machines to start again.

3. Dropped the ABTestUser table

To free up space, we dropped the ABTestUser table (the one with 3.9 million rows causing the bloat). This removed the data but SQLite doesn’t reclaim disk space automatically - you need to VACUUM for that.

4. Attempted VACUUM through LiteFS (Jan 7, 12:50)

We tried to reclaim space by running VACUUM through the LiteFS FUSE mount (via litefs.yml exec command):

12:50:21 level=INFO msg="executing command: sqlite3 [/litefs/data/sqlite.db VACUUM;]"

This caused LiteFS panics - index out of range errors in the checksum code:

12:51:10 level=INFO msg="fuse: panic in handler for Unlock [...]: runtime error: index out of range [52] with length 1
[...]
github.com/superfly/litefs.(*DB).checksum(0xc000185600, 0x4a, 0x2f0d9?)
        /src/litefs/db.go:3227 +0x42c

5. Direct VACUUM on raw database file (bypassing LiteFS)

Since VACUUM through LiteFS was panicking, we ran VACUUM directly on the raw database file:

sqlite3 /data/litefs/dbs/sqlite.db/database "VACUUM;"

This worked - the database went from 753MB to 300KB. Data was intact when queried.

6. Checksum mismatch on restart (Jan 7, 13:20)

After restarting, LiteFS refused to start because the database checksum no longer matched the LTX files:

13:20:28 ERROR: cannot open store: open databases: open database("sqlite.db"): recover ltx: database checksum 9041e1378c03326c on TXID 00000000003bf5ac does not match LTX post-apply checksum da1085844b44090a

Machine hit max restart count (10) and stopped.

7. Deleted LTX files to fix checksum error

To get the machine running again, we deleted the LTX files:

rm /data/litefs/dbs/sqlite.db/ltx/*

8. Replication failed with EOF errors (Jan 7, 13:31)

After clearing LTX, the machine tried to sync as a replica but kept failing:

13:31:21 level=INFO msg="1A56FFE272DE2A60: disconnected from primary with error, retrying: process ltx stream frame: write ltx file: unexpected EOF"

This repeated continuously.

9. Deleted sqlite.db directory on one machine

To fix the EOF errors, we deleted the entire sqlite.db directory on d891265a6e4738:

rm -rf /data/litefs/dbs/sqlite.db

10. Deleted LTX files on the primary too

At this point the primary (e82d390b70e258) still had the data (verified with SELECT COUNT(*) FROM User = 8), but we deleted its LTX files too to try to get a clean sync:

rm /data/litefs/dbs/sqlite.db/ltx/*

11. Data loss (Jan 7, 13:39)

When d891265a6e4738 restarted with an empty database directory, it:

  1. Found the primary
  2. Connected but never received a snapshot
  3. Immediately promoted itself via handoff
  4. Ran migrations on an empty database
  5. Replicated the empty database to the former primary

Summary

The chain was:

  1. Volume full → restart loops
  2. Volume extended → machines can start
  3. Dropped ABTestUser table → data removed, but disk space not reclaimed
  4. VACUUM through LiteFS → panics
  5. VACUUM on raw file → works, but breaks checksums
  6. Delete LTX to fix checksums → breaks replication
  7. Delete sqlite.db to fix replication → creates empty node
  8. Delete LTX on primary → probably broke the handoff safety
  9. Empty node promotes → data loss

The Real Question

You mentioned this isn’t a state reachable during “normal operations” - but what is the recommended way to recover from a full volume?

We couldn’t VACUUM through LiteFS (panics), and VACUUMing directly broke checksums. Extending the volume bought us time, but the database was still bloated.

Is there a supported way to:

  1. VACUUM a LiteFS-managed database safely?
  2. Recover from a checksum mismatch without manual intervention?
  3. Handle a full volume without getting into this state?

Thanks again for investigating. I agree better guardrails would help - even a warning like “database modified outside LiteFS, refusing to start” would have prompted us to restore from snapshot earlier rather than digging deeper into manual fixes.

1 Like

Generally speaking, the litefs export / litefs import pair is the main escape hatch, although I’m surprised that the first VACUUM attempt resulted in a low-level panic, :thinking:…

For clustered databases, in general, a good recovery heuristic is to make a local backup of the database (to your own development laptop/desktop) and then reduce the cluster down to a single node. Preferably, remove the ones that aren’t currently the primary. On the Fly.io platform, in particular, you’ll also want to explicitly destroy the remnant volumes, since those are prone to carry corrupt state over into your next deployment.

As far as I know, vacuuming on /litefs/data (the FUSE mount) is intended to be fully supported. The test suite has a dedicated test of that, if I’m reading it correctly. Perhaps this is a bug triggered only in WAL mode?

https://github.com/superfly/litefs/issues/445


A couple other thoughts/notes…

To be extra conservative, --with-new-volumes can be specified when scaling horizontally. This ensures that you don’t accidentally pick up a corrupt one next time. (By default, fly scale count will reuse any existing volumes that it finds lying around.) As a safety net, you can, moreover, request an immediate snapshot of the primary’s volume, instead of waiting for its 24h cycle to roll around.

Running out of disk space is a difficult situation for all databases… Legacy Postgres had a lot of problems with that as well, from what I’ve heard, and a guardrail was added that drops the cluster into read-only mode when it gets to 90% used (or thereabouts).

Thanks for the detailed guidance! This is really helpful.

VACUUM panic

Yes, the panic during VACUUM through the FUSE mount (/litefs/data) was surprising to us too. We’re running WAL mode, so that might be related:

exec:
  - cmd: sqlite3 $DATABASE_PATH "PRAGMA journal_mode = WAL;"
    if-candidate: true

The panic was specifically in the checksum code:

runtime error: index out of range [52] with length 1
github.com/superfly/litefs.(*DB).checksum(0xc000185600, 0x4a, 0x2f0d9?)
    /src/litefs/db.go:3227 +0x42c
github.com/superfly/litefs.(*DB).CommitWAL(0xc000185600, {0xbeb4e8, 0xc0002f1a40})
    /src/litefs/db.go:1681 +0x1bf7

I believe the issue you mentioned is exactly what happened to us. I’ll post the full details. This could help others who encounter a full volume.

litefs export/import

Good to know this is the official escape hatch. However, I want to flag that the export/import workflow feels like a workaround for what seems like a bug.

We expected VACUUM through the FUSE mount to work - it’s a standard SQLite operation, and LiteFS is designed to transparently handle SQLite operations. The fact that it panicked with an index out of range error in the checksum code suggests something went wrong that shouldn’t have.

The challenge is that export/import as a VACUUM alternative is a bit of a chicken-and-egg problem. We’d only know to use it if we already knew VACUUM through LiteFS would panic. In our case:

  1. We ran VACUUM through LiteFS (expecting it to work)
  2. It panicked
  3. At this point, we didn’t know about export/import, so we tried VACUUMing the raw file
  4. That broke checksums, which led to the cascade of issues

If we had known VACUUM would fail, yes, the safe workflow would have been:

  1. litefs export -name sqlite.db /tmp/backup.db - extract database
  2. fly sftp get /tmp/backup.db ./local-backup.db - download locally
  3. sqlite3 ./local-backup.db "VACUUM;" - VACUUM locally
  4. fly sftp shell → put ./local-backup.db /tmp/backup.db - upload back
  5. rm -rf /data/litefs/dbs/sqlite.db - delete bloated LiteFS state
  6. fly machine restart <machine-id> - restart (LiteFS starts fresh)
  7. litefs import -name sqlite.db /tmp/backup.db - reimport the vacuumed database

But the real question is: when VACUUM through LiteFS panics, what’s the expected recovery path?

At that point, LiteFS is still running (it caught the panic), but the VACUUM failed. Is there guidance on what to do next? The export/import workflow would work, but it requires knowing that VACUUM will fail before you try it.

Thanks again for the detailed response.

Almost all those steps are reasonable, but not this one, :dragon:. I understand the temptation, but don’t ever modify that directory, no matter what the provocation.

I tried a non-interventionist export → vacuum → import cycle over here, without a manual rm, and it returned to reasonable disk usage on its own after the default 10 minute LTX retention period.

There isn’t general-purpose official guidance on what to do when you see a low-level panic in the logs, as far as I know. The old official disaster recovery doc would be the closest, I think, although obviously that was written for a different context.

Personally, I would be conservative and restore from backups at that point, after making efforts to litefs export, fork fallback volumes (using a name other than data, to prevent them from getting auto-attached subsequently), etc. My own recommendations are not at all authoritative, though.

Yeah, the colloquialism “escape hatch” means exactly that. It’s like the emergency doors on an airplane that can be popped open after a crash landing.

If you guys have time someday to contribute to that GitHub thread, that would probably help get the original panic bug fixed…

@mayailurus Have you applied to work at Fly yet? :squinting_face_with_tongue: :relieved_face:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.