Error with Read-Only Replica Configuration in LiteFS Cloud Migration to Litestream + Tigris

buildtheui · October 10, 2024, 4:14pm

Hi everyone,

With the sunsetting of LiteFS Cloud on the horizon , I’m in the process of migrating to Litestream + Tigris backup storage while continuing to use LiteFS with static leasing. However, I’ve encountered a few roadblocks. Currently, I can successfully replicate the database using only a single primary node, but I’m aiming to set up read-only replicas in different regions. Unfortunately, I’m running into the following error on any replica in regions other than the primary:

2024-10-09 17:11:14.425	
time=2024-10-09T22:11:14.423Z level=ERROR msg="failed to run" error="ensure wal exists: disk I/O error"
2024-10-09 17:11:14.425	
time=2024-10-09T22:11:14.423Z level=ERROR msg="error closing db" db=/litefs/my-db.db error="ensure wal exists: disk I/O error"
2024-10-09 17:11:14.417	
level=INFO msg="fuse: write(): wal error: read only replica"

This error shows up after a new replica starts successfully, and the logs persist indefinitely.

Thinking through the issue, I suspect the problem lies in my litefs.yml configuration. In the exec: section, I’m currently using the following command:

  - cmd: "litestream replicate -exec run-server"

based on the Fly.io LiteFS documentation, but since this command should only run on the primary node, I haven’t found a proper way to execute my app on replica nodes. I attempted something like this:

 # If candidate, start server with replication using Litestream to S3
  - cmd: "litestream replicate -exec run-server"
    if-candidate: true

  # If not candidate, just start the server
  - cmd: "run-server"
    if-candidate: false

But I don’t think if-candidate: false is correct, because when deployed, the node always fails.

Is there a way to use the same LiteFS config for replicas, similar to how it was possible with dynamic leasing? Or how can I correctly solve this issue? Below is my full litefs.yml configuration:

fuse:
  dir: "${LITEFS_DIR}"

data:
  dir: "/data/litefs"

exit-on-error: false

proxy:
  addr: ":${INTERNAL_PORT}"
  target: "localhost:${PORT}"
  db: "${DB_URL}"

exec:
  # Run migrations
  - cmd: "goose -dir ${SCHEMAS_DIR} sqlite3 ${DB_URL} up"
    if-candidate: true

  # Set the journal mode for the database to WAL. This reduces concurrency deadlock issues
  - cmd: "sqlite3 ${DB_FILE_URL} 'PRAGMA journal_mode = WAL;'"
    if-candidate: true

  # If candicate, start server with replication using litestring to S3
  - cmd: "litestream replicate -exec run-server"

lease:
  type: "static"
  advertise-url: "http://${PRIMARY_REGION}.${FLY_APP_NAME}.internal:20202"
  hostname: "${PRIMARY_REGION}.${FLY_APP_NAME}.internal"
  candidate: ${FLY_REGION == PRIMARY_REGION}
  promote: ${FLY_REGION == PRIMARY_REGION}

buildtheui · October 10, 2024, 7:03pm

UPDATE: Another Roadblock

I initially managed to resolve my issue by creating the following bash script (note that run-server is my server’s executable binary):

#!/bin/bash

if [ "$FLY_REGION" == "$PRIMARY_REGION" ]; then
  echo "Running primary node: starting replication with Litestream"
  litestream replicate -exec run-server
else
  echo "Running replica node: starting server without replication"
  run-server
fi

I then referenced this script in the exec like so:

# If candidate, start server with replication using Litestream to S3
- cmd: "/usr/local/bin/fly-run-server.sh"

Everything ran smoothly until I sent a POST request in my app, and then I encountered the following error logs:

[proxy[90801623c32d28] bog [error] [PR03] could not find a good candidate within 21 attempts at load balancing. last error: [PR01] no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)

Both my primary node (dfw) and replica (mia) are up. However, when I SSH into the mia node, I cannot find the .primary file. Additionally, when I execute curl -i -X POST http://localhost:8080/endpoint, I get the following response:

HTTP/1.1 200 OK
Fly-Replay: instance=dfw.my-server.internal
Date: Thu, 10 Oct 2024 17:28:30 GMT
Content-Length: 0

The request is forwarding to the hostname I set in litefs.yml, but this seems to be causing a new issue. Based on the Fly.io docs, the Fly-Replay header should contain the primary machine’s instance ID. However, I’m stuck here since there’s no environment variable that exposes the primary machine ID at runtime.

I believe having a PRIMARY_MACHINE_ID environment variable would simplify this configuration significantly. Alternatively, it would be helpful if the internal DNS format ("${PRIMARY_REGION}.${FLY_APP_NAME}.internal") could reliably work as a hostname so replicas can identify the primary node.

Unless something like this exists and I’m missing it, the only workaround I can think of is hardcoding the machine ID in the hostname and redeploying—although I’d prefer not to do that.

system · October 17, 2024, 7:03pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.