PostgreSQL "Connection is closed" error after a few minutes of activity

Hello there!

I’ve successfully deployed my Python app with a PostgreSQL cluster. The app manages the DB connection using SQLAlchemy with the asyncpg driver.

However, I regularly get a server error because the connection between the app and PostgreSQL has been unexpectedly closed. It usually happens after a few minutes (~
15-30 minutes) of inactivity. Here is the error in details:

InterfaceError: connection is closed
  File "sqlalchemy/dialects/postgresql/asyncpg.py", line 709, in _start_transaction
    self._transaction = self._connection.transaction(
  File "asyncpg/connection.py", line 275, in transaction
    self._check_open()
  File "asyncpg/connection.py", line 1400, in _check_open
    raise exceptions.InterfaceError('connection is closed')

Under the hood, SQLAlchemy maintains a pool of connections and grab one of them when it needs to perform queries. So my guess is that Fly.io cuts the connection “hardly” after a moment, which provokes the error when we try to reuse it after.

  • Am I correct to assume this has to do with a “long-lived connection” problem?
  • Is there a documented time limit for a connection to remain open on Fly? If so, it would help me to configure SQLAlchemy so that it recycles connection after this delay.

My app is fief-server and my PostgreSQL cluster is fief-server-db.

Thank you for your help :slightly_smiling_face:

Best!

@fvoron Hey there,

I have not used SQLAlchemy before, but you could try setting keepalive_idle. Our proxy will kill idle client connections after 30 minutes or so, so that’s probably what you’re running into.

Documentation:

1 Like

Hi @shaun!

Thank you for the suggestion! Unfortunately, it seems that asyncpg doesn’t handle this parameter.

After digging up a bit, I found that there is a similar parameter server-side, tcp_keepalives_idle. If I’m looking at the value on the PostgreSQL server, it’s set at 7200 seconds (the default):

postgres=# SELECT name, setting, unit FROM pg_settings WHERE name='tcp_keepalives_idle';
        name         | setting | unit 
---------------------+---------+------
 tcp_keepalives_idle | 7200    | s

From what I understand, PostgreSQL allows the client to override some of its settings. So, this parameter can be set in the client code through the server_settings parameter. Like this:

engine = create_async_engine(
    database_url,
    connect_args={"server_settings": {"tcp_keepalives_idle": "600"}},
)

This is not very well documented in SQLAlchemy, here is the only reference that gave me an hint about this: PostgreSQL — SQLAlchemy 1.4 Documentation

I’ve set it to 10 minutes (600 seconds)… But this doesn’t solve the issue. I’m not 100% sure the parameter is actually taken into account by PostgreSQL.

I also tried to set it directly on the cluster configuration thanks to the guidance found in this topic, by using the stolonctl tool. Unfortunately, it doesn’t solve the issue neither :cry:

I’ve tried something else: use the pool_recycle argument of SQLAlchemy. Basically, it discards connections that are older than a defined threshold.

engine = create_async_engine(
    database_url, pool_recycle=1800
)

I’ve set it at 30 minutes (1800 seconds) to match the limit of the proxy and it seems to solve the issue.

However, it sounds a bit sub-optimal as we may not be able to keep a connection live for more than 30 minutes. It should do it for now, but I would be curious to explore other alternatives if someone has ideas.

I’ve set it to 10 minutes (600 seconds)… But this doesn’t solve the issue. I’m not 100% sure the parameter is actually taken into account by PostgreSQL.

Yeah, in this case, it’s Stolon who manages the PG configuration. Any PG settings that are set directly will be overwritten by Stolon. In any case, you can tweak any of the server-side settings using the following strategy:

Note: We are in the process of making this easier.

# SSH into your Postgres app.  
fly ssh console --app <app-name>

# Export the some stolonctl specific environment variables
export $(cat /data/.env | xargs)

# Update stolon spec
stolonctl update --patch '{"pgParameters": { "tcp_keepalives_idle": "<seconds>"}}'

1 Like

Yes, that’s what I did ultimately. The parameter was correctly set, but it didn’t solve the issue. Maybe the OS doesn’t send the TCP message or the proxy ignores it, I’m not sure how I could debug that.

Hi! Sorry to hijack this thread, but I believe I’m having the same issue, with a slightly different tech set up. I’m running the good_job for background processing on a Rails application, and I’m having a hard connection cut off from the server every 30 minutes too.

PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

I reported it to the gem issue tracker (ActiveRecord::StatementInvalid: PG::ConnectionBad: PQsocket() can't get socket descriptor every 30 minutes aprox. · Issue #579 · bensheldon/good_job · GitHub) but it seems that this problem would be on the server side from what I’m reading here! We’re seeing this exactly every 30 minutes or so

ActiveRecord’s configuration will by default set an idle_timeout of 5 minutes("number of seconds that a connection will be kept unused in the pool before it is automatically disconnected (default 300 seconds). Set this to zero to keep connections forever." ActiveRecord::ConnectionAdapters::ConnectionPool), which is set in this app:

irb(main):006:0> ActiveRecord::Base.connection_pool.db_config.idle_timeout
=> 300.0

but I’m still getting this error message every 30 mins or so. Is there any other setting in postgres that I could tweak to fix this?

In my case, the equivalent parameter, pool_recycle solved the issue by making the client proactively close the connection before the proxy does.

I’m not a Ruby expert but a first step could be to make sure that this parameter is correctly taken into account by your worker.