sqlite error when attempting to update records with multiple machines

I’m regularly seeing the following error come through when I scale beyond one machine:

Error occurred during query execution:
ConnectorError(ConnectorError { user_facing_error: None, kind: QueryError(SqliteError { extended_code: 778, message: Some("disk I/O error") }), transient: false })

I deployed my remix (epic stack) app following the instructions here: epic-stack/docs/deployment.md at main · epicweb-dev/epic-stack · GitHub and saw no issues.

but after I scaling up to 2 machines (following the instructions here epic-stack/docs/database.md at main · epicweb-dev/epic-stack · GitHub) I’ve started to see these issues come through and see the same when connecting prisma studio to the primary machine and attempting to update records there.

is there some configuration I’m missing? currently have 2 machines in lhr region (with one of them currently the primary) and another couple in ewr.

any help very much appreciated!

Hi… Given that it occurs during writes…

If the local node is not the primary then SQLite […] will return a disk I/O error

https://fly.io/docs/litefs/primary/

Perhaps you were unknowingly connecting to the other machine in the primary region, rather than to the actual primary machine? (They do hand off, in some circumstances.)

fly logs should help narrow this down, either way…

Thank you for the prompt response and guidance :raised_hands: I must admit that I’m not very qualified in this space, I’m more of a frontend engineer.

It does look like it’s trying to update on a non-primary machine, I just don’t know why, or how I prevent this? Is additional configuration required?

For context, this is happening when a remix loader is attempting to update a user via prisma:

 prisma.user.update({
  where: { id },
  data: { ...user },
})

PrismaClientUnknownRequestError: Invalid prisma.user.update() invocation: Error occurred during query execution: ConnectorError(ConnectorError { user_facing_error: None, kind: QueryError(SqliteError { extended_code: 778, message: Some(“disk I/O error”) }), transient: false })

How would I ensure that this write is always attempted on this primary?

I’m also pretty confused as to why I get the same error when connecting to the primary machine with prisma studio, when I am definitely choosing the primary machine?

Following up - I have scaled back up to just two machines (fly scale count 2 --region lhr) in the primary region and still seeing the issue.

I set up log shipper to see if it surfaced anything else useful:

2024-03-20 13:14:35.577  [INFO] lhr ...3d708 GET /account 302 - - 90.359 ms
2024-03-20 13:14:35.577  [INFO] lhr ...3d708 GET /account 302 - - 90.359 ms
2024-03-20 13:14:36.302  [INFO] lhr ...3d708 level=INFO msg="fuse: write(): wal error: read only replica
2024-03-20 13:14:36.302  [INFO] lhr ...3d708 level=INFO msg="fuse: write(): wal error: read only replica

2024-03-20 13:14:36.312  [INFO] lhr ...3d708 ConnectorError(ConnectorError { user_facing_error: None, kind: QueryError(SqliteError { extended_code: 778, message: Some("disk I/O error") }), transient: false })
2024-03-20 13:14:36.312  [INFO] lhr ...3d708 ConnectorError(ConnectorError { user_facing_error: None, kind: QueryError(SqliteError { extended_code: 778, message: Some("disk I/O error") }), transient: false })
2024-03-20 13:14:37.133  [INFO] lhr ...3d708 GET /resources/stripe/create-customer 500 - - 1091.505 ms
2024-03-20 13:14:37.133  [INFO] lhr ...3d708 GET /resources/stripe/create-customer 500 - - 1091.505 ms
2024-03-20 13:14:37.181  [INFO] lhr ...336d8 HEAD / 200 - - 54.568 ms
2024-03-20 13:14:37.181  [INFO] lhr ...336d8 HEAD / 200 - - 54.568 ms

So the writes are happening on the replica (…3d708) and each request fires twice?

The LiteFS proxy was added to make that much easier, although it does have several limitations.

One of them is that you can’t perform writes within a GET request. (From the logs, it looks like you might have such.)

I’m surprised by this part, too.

How about trying with the sqlite3 command line—from within fly ssh console—to reduce moving parts?

$ fly ssh console --machine ...336d8  # primary
# # note:  this script is a little defensive about env vars not being defined, etc.
# test -d "${LITEFS_DIR:-not-there}" || echo litefs dir at unexpected location.
# test -f "${LITEFS_DIR}"/.primary && echo not primary.  # should not exist
# sqlite3 "${DATABASE_PATH:-/not-there/missing.db}"
> create table TST (X int);
> insert into  TST values (7);
> .q

Scanning the logs for recent primary lease acquired messages, to make sure they still match the machine ID, might also be prudent.

(The test -f way can give false positives.)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.