After running without problems for a year, a site of mine with a read-only sqlite database stopped working. I narrowed down the problem to the connection to sqlite timing out.
Nothing has changed on my side so not sure what the root cause is.
Oddly, it starts working briefly if I ssh into each machine and fire up a Python repl and open a sqlite connection.
But then it stops working the moment that machine is stopped.
Are you using LiteFS? I suppose with a r/o database, there’d be no point, but equally SQLite doesn’t have a listener in the traditional TCP sense, so I’d wonder what actually would time out.
If you SSH into the machine, does the web app start working, or does SQLite only respond via Repl?
As it’s read only, I’m just using a sqlite file in the container.
The web app starts working once I’ve successfully made the connection via the repl. One theory I have there is that the timeout is different for the web request versus the repl over ssh.
But I’m not sure why this suddenly became a problem after a year.
Well, that is most odd. Does it all work if you run it under Docker on your development machine?
I don’t do Python myself, but maybe if you put your connection code here, someone might spot something. Also, do Python db drivers generally have a debug or logging mode that can be tried?
Finally, is your database file sitting on a volume, or is it part of your Docker image?
As long as a connection to the database has been made, all subsequent attempts (until machine restart) are fast.
I don’t think the first connection can be successful via the web page since it times out (since today). But if I shell in and either run a standalone script that connects or connect via the repl then all subsequent db connections (including via the web app) are fast.
OK, mystifying. Good detail on the question now though.
Can you make the project available on GitHub, or a cut-down version of the same? I am wondering if there’s something really small that would make a difference, like a minor upgrade to Python, and it might take someone tinkering to see what the issue is. Of course, it might be that there is a weirdness in Fly that upsets Python in particular.
Note: I just timed running my standalone script (that is just doing a sqlite3.connect) and it takes just over a minute to run the first time so I bet that delay is what’s timing out in the web app.
As the site outage is a huge problem for users, I’ve temporarily turned off auto_stop_machines and set min_machines_running = 1. It will cost a lot more but will hopefully avoid the issue until the root cause is found.