Hi everyone,
With the sunsetting of LiteFS Cloud on the horizon , I’m in the process of migrating to Litestream + Tigris backup storage while continuing to use LiteFS with static leasing. However, I’ve encountered a few roadblocks. Currently, I can successfully replicate the database using only a single primary node, but I’m aiming to set up read-only replicas in different regions. Unfortunately, I’m running into the following error on any replica in regions other than the primary:
2024-10-09 17:11:14.425
time=2024-10-09T22:11:14.423Z level=ERROR msg="failed to run" error="ensure wal exists: disk I/O error"
2024-10-09 17:11:14.425
time=2024-10-09T22:11:14.423Z level=ERROR msg="error closing db" db=/litefs/my-db.db error="ensure wal exists: disk I/O error"
2024-10-09 17:11:14.417
level=INFO msg="fuse: write(): wal error: read only replica"
This error shows up after a new replica starts successfully, and the logs persist indefinitely.
Thinking through the issue, I suspect the problem lies in my litefs.yml
configuration. In the exec:
section, I’m currently using the following command:
- cmd: "litestream replicate -exec run-server"
based on the Fly.io LiteFS documentation, but since this command should only run on the primary node, I haven’t found a proper way to execute my app on replica nodes. I attempted something like this:
# If candidate, start server with replication using Litestream to S3
- cmd: "litestream replicate -exec run-server"
if-candidate: true
# If not candidate, just start the server
- cmd: "run-server"
if-candidate: false
But I don’t think if-candidate: false
is correct, because when deployed, the node always fails.
Is there a way to use the same LiteFS config for replicas, similar to how it was possible with dynamic leasing? Or how can I correctly solve this issue? Below is my full litefs.yml
configuration:
fuse:
dir: "${LITEFS_DIR}"
data:
dir: "/data/litefs"
exit-on-error: false
proxy:
addr: ":${INTERNAL_PORT}"
target: "localhost:${PORT}"
db: "${DB_URL}"
exec:
# Run migrations
- cmd: "goose -dir ${SCHEMAS_DIR} sqlite3 ${DB_URL} up"
if-candidate: true
# Set the journal mode for the database to WAL. This reduces concurrency deadlock issues
- cmd: "sqlite3 ${DB_FILE_URL} 'PRAGMA journal_mode = WAL;'"
if-candidate: true
# If candicate, start server with replication using litestring to S3
- cmd: "litestream replicate -exec run-server"
lease:
type: "static"
advertise-url: "http://${PRIMARY_REGION}.${FLY_APP_NAME}.internal:20202"
hostname: "${PRIMARY_REGION}.${FLY_APP_NAME}.internal"
candidate: ${FLY_REGION == PRIMARY_REGION}
promote: ${FLY_REGION == PRIMARY_REGION}