LiteFS "wal error: read only replica" on a POST Request

nicolas-besnard · August 25, 2024, 7:35pm

I have 2 machines in the same region, but when I make a POST request, if the request is not on the primary machine, the request will fail with the error “wal error: read only replica”. I saw that you need to use fly-replay for a GET request, but I thought that a POST request should be automatically redirected to the primary machine. Should I always check for if I’m on the primary machine every time I receive a post?

Here are my configuration files:

app = ''
primary_region = 'cdg'
kill_signal = 'SIGINT'
kill_timeout = 5
swap_size_mb = 512

[experimental]
auto_rollback = true

[mounts]
source = 'data'
destination = '/data'

[[services]]
  internal_port = 8080
  processes = [ "app" ]
  protocol = "tcp"
  script_checks = [ ]

  [services.concurrency]
  hard_limit = 100
  soft_limit = 80
  type = "requests"

  [[services.ports]]
  handlers = [ "http" ]
  port = 80
  force_https = true

  [[services.ports]]
  handlers = [ "tls", "http" ]
  port = 443

  [[services.tcp_checks]]
  grace_period = "1s"
  interval = "15s"
  restart_limit = 0
  timeout = "2s"

  [[services.http_checks]]
  interval = "10s"
  grace_period = "5s"
  method = "get"
  path = "/healthcheck"
  protocol = "http"
  timeout = "2s"
  tls_skip_verify = false
  headers = { }
  [[services.http_checks]]
  grace_period = "10s"
  interval = "30s"
  method = "GET"
  timeout = "5s"
  path = "/litefs/health"

[[vm]]
  memory = '512mb'
  cpu_kind = 'shared'
  cpus = 1

# Documented example: https://github.com/superfly/litefs/blob/dec5a7353292068b830001bd2df4830e646f6a2f/cmd/litefs/etc/litefs.yml
fuse:
  # Required. This is the mount directory that applications will
  # use to access their SQLite databases.
  dir: '${LITEFS_DIR}'

data:
  # Path to internal data storage.
  dir: '/data/litefs'

# This flag ensure that LiteFS continues to run if there is an issue on starup.
# It makes it easy to ssh in and debug any issues you might be having rather
# than continually restarting on initialization failure.
exit-on-error: false

proxy:
  # matches the internal_port in fly.toml
  addr: ':8080'
  target: 'localhost:3000'
  db: '${DATABASE_FILENAME}'

# The lease section specifies how the cluster will be managed. We're using the
# "consul" lease type so that our application can dynamically change the primary.
#
# These environment variables will be available in your Fly.io application.
lease:
  type: 'consul'
  candidate: ${FLY_REGION == PRIMARY_REGION}
  promote: true
  advertise-url: 'http://${HOSTNAME}.vm.${FLY_APP_NAME}.internal:20202'

  consul:
    url: '${FLY_CONSUL_URL}'
    key: 'nicolas-besnard-litefs/${FLY_APP_NAME}'

exec:
  - cmd: npx prisma migrate deploy
    if-candidate: true

  # Set the journal mode for the database to WAL. This reduces concurrency deadlock issues
  - cmd: sqlite3 $DATABASE_PATH "PRAGMA journal_mode = WAL;"
    if-candidate: true

  - cmd: npm start

mayailurus · August 25, 2024, 9:28pm

No… The LiteFS proxy, which runs within your Machine, should be doing that itself.

The screenshot from your earlier post does show it listening on the correct port, so I think this might be a case of the outer Fly edge proxy having partially stale metadata.

Could you try changing proxy.target in litefs.yml back to 8081? I know you switched it to 3000 to avoid (seemingly) spurious warnings in the logs, but it looks like it might be more substantive…

Aside: You can see a redirection in detail via:

$ fly ssh console --select  # select non-primary
# cat /litefs/.primary
432189ff765432
# curl -i -X POST http://localhost:8080/endpoint
HTTP/1.1 200 OK
Fly-Replay: instance=432189ff765432
Date: Sun, 25 Aug 2024 21:13:48 GMT
Content-Length: 0

The Fly edge proxy will notice the Fly-Replay entry and then send the client’s original request (the one that came in via app-name.fly.dev) to 432189ff765432 instead.

nicolas-besnard · August 26, 2024, 4:43pm

It’s working Not sure what I’m supposed to do.

LiteFS and Sqlite are promising but the documentation is too light and I would have waste less time and money using a Postgres

khuezy · August 26, 2024, 5:11pm

Why not use a managed sqlite service like turso (there are probably a few more providers)

mayailurus · August 26, 2024, 10:02pm

In my view, this indicates a bug in Fly’s networking infrastructure—and not in LiteFS.

It would maybe help to recap a bit…

Two proxies are involved: the LiteFS proxy and the Fly edge proxy. The LiteFS proxy you can fully observe, the other you mostly cannot. It looks like it’s this latter that is malfunctioning sporadically.

	LiteFS proxy	Fly edge proxy
Can be observed?	Yes	mostly not
Can be bypassed?	Yes	No
Appears to be working?	Yes	sporadic failures (?)
Part of LiteFS?	Yes	No
Configuration file	`litefs.yml`	`fly.toml`

The hypothesis is that the Fly edge proxy has two different “lobes” (so to speak), one thinking internal_port = 8080 and another thinking internal_port = 3000.

(These situations are the bane of distributed systems everywhere.)

An incoming request to app-name.fly.dev from a client hits one of those two lobes basically at random. In the first case, it correctly gets forwarded to the LiteFS proxy, which then applies the necessary POST redirection. In the second case, it bypasses the LiteFS proxy (which, unlike the other proxy layer, can be bypassed) entirely and immediately hits Node. This causes an attempted write on the read-only replica.

diagram (with a mildly Bauhaus feel) showing two different paths from to the Node process within the Machine: in black, client -> Fly edge proxy (first "lobe") -> LiteFS proxy -> Node; in orange, client -> Fly edge proxy (errant second "lobe", with stale metadata) -> Node (helpfully labeled "whoops!", on that last arc)

If you want to pursue this further (and possibly you might not want to), a simple thing you can try next is creating an entirely new application and repeating your experiments in there.

(These metadata are associated with the application and the Machines, as I understand it. Fresh instances would hopefully shake free.)

I do agree to some extent…

LiteFS is v0.5 software and is generally not a good choice when there is a low budget for development and tinkering time.

That said, you have run into way more trouble than is typical, .

Apparently, these extra problems have their real root elsewhere, not within LiteFS.

system · September 2, 2024, 10:03pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error with Read-Only Replica Configuration in LiteFS Cloud Migration to Litestream + Tigris machines , litefs	2	92	October 17, 2024
Preventing Read replica from trying to write litefs Questions / Help litefs	11	778	April 24, 2024
Litefs proxy intermittent failure to proxy POST request to primary node	2	132	June 7, 2024
LiteFS proxy always-forward option not working for me Questions / Help docs , litefs	5	87	July 10, 2024
Connect external litefs read replica to primary in fly.io Questions / Help sqlite , litefs	2	62	September 3, 2024

LiteFS "wal error: read only replica" on a POST Request

Related topics