Silent TCP connection drops to managed Postgres cause indefinite hangs

We’re running a Rails app with SolidQueue (PostgreSQL-backed job queue) on Fly.io in the Sydney region. Our worker process maintains a persistent connection to managed Postgres (direct endpoint, not PgBouncer) and polls every 0.1s with SELECT … FOR UPDATE SKIPLOCKED.

Every 12-24 hours, one of these queries hangs forever. The TCP connection silently dies — no RST, no FIN — and our app blocks on recv() indefinitely.

How we know the connection is silently dead:

  1. statement_timeout: 10s is set and verified on the connection (SHOW statement_timeout returns 10s). Yet the query hangs for 10+ minutes. This means the query never reached PostgreSQL — PG doesn’t know there’s a query to time out.
  2. TCP keepalives (keepalives_idle=10, keepalives_interval=5, keepalives_count=3) didn’t detect it — because the socket is active (mid-query), not idle, so keepalive probes don’t fire.
  3. Other connections in the same process to the same Postgres instance work fine simultaneously — it’s a single socket that dies.
  4. Manual ReadyExecution.claim(…) from a Rails console (new connection) works immediately — Postgres is healthy.
  5. Strongly correlated with query frequency: 0.1s polling (10 queries/sec) hangs within hours; 1s polling is stable for days. Higher frequency = more chances to be mid-query when the connection drops.

What we’ve deployed as a workaround: tcp_user_timeout: 15000 on the connection, which should cap the hang at 15s. But the underlying issue — Fly’s network layer silently killing TCP connections to managed Postgres — remains.

Environment: Sydney region, managed Postgres (direct endpoint direct.*.flympg.net), single-machine app + worker topology.

Is this a known issue with Fly’s internal networking? Is there anything we can configure on the Postgres or machine side to prevent these silent drops?

(an LLM wrote this but I’ve verified it)

1 Like

Hey @ghiculescu

Is this a known issue with Fly’s internal networking?

No.

Can you share your app name and a few example timestamps when a connection got stuck. I’m gonna take a closer look.

If you don’t want to share your app name publicly, feel free to send via email to pavel fly.io.