sqlite connection started hanging

After running without problems for a year, a site of mine with a read-only sqlite database stopped working. I narrowed down the problem to the connection to sqlite timing out.

Nothing has changed on my side so not sure what the root cause is.

Oddly, it starts working briefly if I ssh into each machine and fire up a Python repl and open a sqlite connection.

But then it stops working the moment that machine is stopped.

1 Like

Are you using LiteFS? I suppose with a r/o database, there’d be no point, but equally SQLite doesn’t have a listener in the traditional TCP sense, so I’d wonder what actually would time out.

If you SSH into the machine, does the web app start working, or does SQLite only respond via Repl?

As it’s read only, I’m just using a sqlite file in the container.

The web app starts working once I’ve successfully made the connection via the repl. One theory I have there is that the timeout is different for the web request versus the repl over ssh.

But I’m not sure why this suddenly became a problem after a year.

Literally it’s the sqlite3.connect function in my Flask app.

If I do the same sqlite3.connect in the repl (or in a standalone Python script run on the ssh console) it works.

Well, that is most odd. Does it all work if you run it under Docker on your development machine?

I don’t do Python myself, but maybe if you put your connection code here, someone might spot something. Also, do Python db drivers generally have a debug or logging mode that can be tried?

Finally, is your database file sitting on a volume, or is it part of your Docker image?

The database file is part of the Docker image.

If I ssh into my fly machine, fire up a Python repl and do

import sqlite3
sqlite3.connect(“my_database.db”)

It takes a long time to respond the first time. Subsequent attempts are fast.

Unfortunately the moment the machine restarts, the problem starts all over again.

And as mentioned, this hasn’t been a problem for the last year and I hadn’t deployed for nine months.

Righto. To clarify here, do you mean a restart that happens as a result of a (re)deployment?

This is when you’re shelled in, right?

Or just the machine stopping after inactivity.

As long as a connection to the database has been made, all subsequent attempts (until machine restart) are fast.

I don’t think the first connection can be successful via the web page since it times out (since today). But if I shell in and either run a standalone script that connects or connect via the repl then all subsequent db connections (including via the web app) are fast.

OK, mystifying. Good detail on the question now though.

Can you make the project available on GitHub, or a cut-down version of the same? I am wondering if there’s something really small that would make a difference, like a minor upgrade to Python, and it might take someone tinkering to see what the issue is. Of course, it might be that there is a weirdness in Fly that upsets Python in particular.

Let me put the relevant files up as a gist.

Note: I just timed running my standalone script (that is just doing a sqlite3.connect) and it takes just over a minute to run the first time so I bet that delay is what’s timing out in the web app.

1 Like

Lovely, I might take a peek tomorrow evening :star_struck:

Here’s the gist: Dockerfile · GitHub

Great stuff. Does that Gist need a copy of requirements.txt too?

No, the requirements.txt file doesn’t have anything relevant to the test case.

1 Like

As the site outage is a huge problem for users, I’ve temporarily turned off auto_stop_machines and set min_machines_running = 1. It will cost a lot more but will hopefully avoid the issue until the root cause is found.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.