Postgres client causing deadlock (Too many open files) - SOLVED

Hi!

I run an application based on FastAPI + SQLAlchemy (Postgres with asyncpg). Over a certain amount of
time, about 2 days, the application soft locks due to the OS blocking new connections to the Postgres server. I’ve tried to remedy this without success. Current fix is to restart the application every 2 days.

This does not happen locally or on our last cloud provider. I expect there are some subtle differences in
how you have ulimit configured compared to my desktop or e.g. Google/DO.

Here’s the entire stacktrace of when this error happens. After this, all we get is Error 1002: No suitable (healthy) instance found to handle request

Thank you!

 2022-01-16T06:15:44.079 app[e5f6ca6d] fra [info] ERROR:uvicorn.error:Exception in ASGI application

2022-01-16T06:15:44.079 app[e5f6ca6d] fra [info] Traceback (most recent call last):

2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/code/./api/event.py", line 35, in get_all_events
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/code/./crud/event.py", line 32, in get_all_events
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/ext/asyncio/session.py", line 143, in execute
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 127, in greenlet_spawn
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1688, in execute
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1529, in _connection_for_bind
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 747, in _connection_for_bind
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/future/engine.py", line 419, in connect
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3166, in connect
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3245, in raw_connection
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3212, in _wrap_pool_connect
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 307, in connect
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 767, in _checkout
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 430, in checkout
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 427, in checkout
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 580, in get_connection
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 611, in __connect
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 605, in __connect
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 578, in connect
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 584, in connect
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 747, in connect
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 69, in await_only
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 122, in greenlet_spawn
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/asyncpg/connection.py", line 2045, in connect
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/asyncpg/connect_utils.py", line 790, in _connect
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/asyncpg/connect_utils.py", line 776, in _connect
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/asyncpg/connect_utils.py", line 676, in _connect_addr
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/asyncpg/connect_utils.py", line 720, in __connect_addr
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/asyncpg/compat.py", line 66, in wait_for
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/asyncio/tasks.py", line 481, in wait_for
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "/usr/local/lib/python3.9/site-packages/asyncpg/connect_utils.py", line 586, in _create_ssl_connection
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] File "uvloop/loop.pyx", line 1956, in create_connection
2022-01-16T06:15:44.080 app[e5f6ca6d] fra [info] OSError: [Errno 24] Too many open files 

Are you using a Dockerfile? This thread talks about the default setting and how you can overcome it with a custom script at startup.

If you’re not using Docker, you could do something similar with a wrapper for your start command.

2 Likes

I’ll try this and get back to you. Thanks!

This has helped. It seems there’s something in my backend not closing file descriptors, as I don’t expect to be at 5000 without much traffic.
I’ll keep an eye out if this is not enough and see if I’m leaking somewhere.

root@9dd41811:/# cat /proc/510/limits | grep 'Max open files'
Max open files            65536                65536                files     
root@9dd41811:/# ls -l /proc/510/fd | wc -l
5605

EDIT:
Actually!
It does seem like uvicorn or FastAPI is not closing connections properly. I have a few thousand requests stuck in CLOSE-WAIT…

root@9dd41811:/# ss | grep CLOSE-WAIT | wc -l
4558
1 Like

Which address is the target of the close-wait connections?

I figured out the issue. The discord.py bot was not able to keep the WS connection open, so it would die for unknown reasons and reconnect with a bad back-off strategy. This meant it would keep the connections open forever, waiting for Discord to return data on that connection. Error:

INFO:discord.gateway:Websocket closed with 1000, cannot reconnect.
ERROR:discord.client:Attempting a reconnect in 2.43s

Address: 162.159.136.234:https

Turning off the bot made the open sockets more reasonable, aka less than 10 ports incl. postgres and SSH. Any suggestions to why these connections would fail? That IP belongs to Cloudflare it seems.

One step closer to resolution: This only seems to be happen if I use https://www.uvicorn.org/ instead of Hyperloop.

Resolution: Regression in uvicorn 0.14.

Reason this might happen is because Fly’s health checks might send an empty TCP request and ends with a CLOSE_WAIT state. + the problem with uvicorn.

SOLVED
If running with TCP checks in Fly, the uvicorn + uvloop bug linked above triggers and never closes the connections. This does not happen with HTTP checks as they are handled properly.
Solution is to use HTTP checks, or to use the h11 setting with uvicorn + uvloop.