I’ve been running a small Rails app in production using SQLite and LiteFS for a couple of weeks. About 4 days ago I cloned to a second region (yyz + cdg) and am periodically getting the following error:
SQLite3::SQLException: cannot rollback - no transaction is active.
Digging deeper into the root cause on BugSnag, I can see that the underlying error is:
SQLite3::IOException · disk I/O error
It doesn’t seem to be causing any major issues, but I’d love to get to the bottom of why this is happening. From my research, it seems most people report seeing this error when the host is low on disk space, which matches up with the error above. The volumes have tons of space (only using 75MB of a 1GB volume), so the only place that may be having that issue is the machines themselves (or the LiteFS proxy?). I think this only started after deploying to the second region, but I can’t be sure, BugSnag wasn’t setup before that.
Curious if anyone has seen or reported this before and has any tips on how to mitigate.
Hi… Typically, this (very non-intuitively) means that you’re trying to write to a read-only replica:
If the local node is not the primary then SQLite will return […] a disk I/O error when using the write-ahead log. Unfortunately, there’s not a better error that LiteFS can return to SQLite when using the WAL.
If you haven’t encountered it already, the LiteFS Proxy can simplify the (mandatory) task of rerouting such actions to the primary, although it does have its limitations, .
I started this app with LiteFS and missed a lot of the nuance in the documentation around the proxy. My app was already using the proxy, but there were 2 GET request endpoints that had write side-effects, causing this error.
For one of them, the solution seems to be to move the write logic to a background job rather than doing them within the request.
This raises another question: how are jobs handled when they need to write? Is that something that I need to explicitly take care of in my app code? I’m using Solid Queue and having Puma manage the supervisor via the plugin. The implication (as far as I understand it) is that the Solid Queue supervisor will run on each machine, so writes from the job could potentially be pointed at the replica rather than the primary.
I’m going to deploy with this new Solid Queue setup and see what happens, but would love any insight you may have about this setup.