Intermittent client-side hangs on Managed Postgres

Hi all,

Anyone running a Node.js app on Fly Machines against Managed Postgres (Crunchy, PG 17.4) with the pg library and queries returning large jsonb payloads (~600 KB+ rows)?

We see intermittent hangs at modest load (3 websites sites with large traffic, “Launch” MPG at 60/200 connections, 3% CPU): roughly 10% of requests on queries returning large payloads never resolve, and the client-side timeout fires after 10s. At that moment, the TCP socket is alive, ~950 KB of bytes did arrive on it, but pg.activeQuery stays true forever (no ReadyForQuery parsed). A fresh SELECT 1 from the same pool replies in 3 ms, and PG-side the query completed normally.

Our guess: TCP chunking on Fly 6PN (smaller MTU than AWS VPC) → many more small segments per response → exposes a slow path in the pure-JS pg-protocol parser. But it’s a guess.

Anyone seen this, or know of a workaround beyond “retry on a fresh conn”? Happy to share more diagnostic details if useful.

Thanks!

Quick update for anyone landing here with the same problem.

Spent today reproducing this in a controlled way. TL;DR: it’s not node-postgres or knex.

Reproduced cleanly with raw psql (libpq, C client) on a Fly Machine talking to MPG over 6PN. Pattern: a TCP connection that’s served many queries fine eventually hits the bug on one query and is permanently stuck, never delivers another response. Socket stays “alive” at the kernel level (readable, writable, no FIN/RST). Outer timeout has to kill psql to recover.

Numbers from a controlled run, 3 parallel single-connection psql sessions × 2000 queries each (heavy ~700 KB jsonb response):

AWS RDS (control, public internet): 6000/6000 OK, no hangs
MPG via direct..flympg.net: 1135/6000 — all 3 conns dead at 61, 142, and 773 queries
MPG via pgbouncer..flympg.net: ~3600/6000, conns last ~5x longer before stalling

Things ruled out along the way:

  • node-postgres / knex / Node (libpq psql repros it)
  • tcp_user_timeout=5s (no measurable effect)
  • statement_timeout (never fires; query is “done”
    from PG’s POV, the response
    delivery is what stalls)
  • Payload <100 KB (rarely hits)

Strong correlation with response size, small queries basically never trigger it. Same query against RDS over the public internet from the same Fly Machine: zero failures.

Filed a support ticket with the same data. Will update here if I get anything substantive back.

Anyone else hitting this on heavy-payload queries against MPG? Especially curious if pgbouncer endpoint is enough for your case or if you’ve found something better.

We have been seeing something that sounds similar, and i’m so glad you took the time to create a clean reproduction.

We are trying to import data in a way that was writing ~300k of json per insert and in the middle of the import a connection would throw a 30s timeout, this happened something like 6 out of 7 attempts. We thought ‘maybe 300k is too big’ and halved the payload size. writing about ~150k per insert it failed 1 out of 3 attempts.

We are also trying to write in parallel. We were first using a 4x box, and saw the same results on a 16x box. We are using the direct.flympg.net url because we noticed a major loss in speed when trying to firehose the import through pgbouncer (Ecto, the defacto Elixir db library requires that we use transaction mode, which may be the speed difference)

Happy to help with any more information, but it sounds like you were able to do more spelunking than we have.

Hello, glad to see i’m not the only one :sweat_smile:
I’ve made a quick scripts that queries the database with different sizes, and here are the results :
RDS direct access over internet :

===================================================================
SUMMARY

size(KB) total ok hang err hang% avg-ok-ms
1 300 300 0 0 0.00% 132
10 300 300 0 0 0.00% 135
100 300 300 0 0 0.00% 139
500 300 300 0 0 0.00% 158
1000 300 300 0 0 0.00% 183
2000 300 300 0 0 0.00% 908
5000 300 300 0 0 0.00% 2319

MPG local direct access 6PN

===================================================================
SUMMARY

size(KB) total ok hang err hang% avg-ok-ms
1 300 300 0 0 0.00% 445
10 300 300 0 0 0.00% 1439
100 300 300 0 0 0.00% 1663
500 300 299 1 0 0.33% 1526
1000 300 297 3 0 1.00% 1357
2000 300 297 3 0 1.00% 1585
5000 300 297 3 0 1.00% 2118

Clearly all queries with more than 500kb start failing.

In real life, connections are stuck and permafail once they hang, so those results are not relevent for production. In production, connections are kept, and once they hangs, all following queries in the same opened connection hang too.

I wanted to mitigate this with client timeouts and retry, but the thing is that the client-server socket is bugged so the client itself (node pg, shell pg, etc) is not aware of the query failure (or more precisely, non-completion) and keep waiting indefinitly. So it’s impossible without a timeout, and then our connection pool gets quickly filled with poisoned connections.

I’m very curious about what the support will do about it, cause i’m having heavy bandwith cost from AWS (RDS-FLY) that are not sustainable. So either we find a solution for MPG, or we have to go back to AWS :frowning:

Greetings !

Hi folks,

Thank you for reporting this. We’re currently investigating the issue. (:

For other users affected: it would be helpful if you could open a ticket and provide your cluster ID so we can gather data from PgBouncer and Postgres side.

Are you able to share your reproduction script with sensitive details masked if any? I have been trying to reproduce this with ~1MB records but so far not having any luck.

Hello, you can find the scripts here : GitHub - quiliumio/fly-debug-mpg · GitHub

I think the more relevant is the singleconn.sh

On MPG :

Target : *******
Payload : 700 KB (SELECT repeat(‘x’, 700*1024))
Iter/conn : 2000
Conns : 3 in parallel (each on its own psql session)
PG stmt to: 10000ms (server-side)
Wall cap : 600s per session
Out dir : /tmp/mpg-singleconn-oODqx7

… conn1=61 conn2=42 conn3=115 (total=218/6000)
… conn1=61 conn2=42 conn3=115 (total=218/6000)
… conn1=61 conn2=42 conn3=115 (total=218/6000)
… conn1=61 conn2=42 conn3=115 (total=218/6000)
… conn1=61 conn2=42 conn3=115 (total=218/6000)
… conn1=61 conn2=42 conn3=115 (total=218/6000)
… conn1=61 conn2=42 conn3=115 (total=218/6000)
… conn1=61 conn2=42 conn3=115 (total=218/6000)
… conn1=61 conn2=42 conn3=115 (total=218/6000)
… conn1=61 conn2=42 conn3=115 (total=218/6000)
… conn1=61 conn2=42 conn3=115 (total=218/6000)
… conn1=61 conn2=42 conn3=115 (total=218/6000)

Aborted here, all 3 conn are hanging.

On RDS :

Payload : 700 KB (SELECT repeat(‘x’, 700*1024))
Iter/conn : 2000
Conns : 3 in parallel (each on its own psql session)
PG stmt to: 10000ms (server-side)
Wall cap : 600s per session
Out dir : /tmp/mpg-singleconn-genhF1

… conn1=253 conn2=277 conn3=264 (total=794/6000)

… conn1=536 conn2=585 conn3=554 (total=1675/6000)
… conn1=844 conn2=898 conn3=889 (total=2631/6000)
… conn1=1136 conn2=1202 conn3=1193 (total=3531/6000)
… conn1=1443 conn2=1505 conn3=1501 (total=4449/6000)
… conn1=1741 conn2=1821 conn3=1800 (total=5362/6000)

Per-conn results:
{“conn”:2,“iter”:2000,“completed”:2003,“completedPct”:100.2,“errors”:0,“walltimeS”:33,“reason”:“finished”}
{“conn”:3,“iter”:2000,“completed”:2003,“completedPct”:100.2,“errors”:0,“walltimeS”:33,“reason”:“finished”}
{“conn”:1,“iter”:2000,“completed”:2003,“completedPct”:100.2,“errors”:0,“walltimeS”:34,“reason”:“finished”}

Greetings,

Thank you! I did reproduce a hang, currently looking into this.

@Gyorn @dewjer I just rolled out a possible fix for this, can you check if this improved things on your side? I can no longer reproduce this hang myself using my test MPG clusters.

So far so good! Over time we have been tuning various knobs down (cut chunk size in half, halved number of Flow stages) just to reliably import a decade+ of data. I just doubled the number of flow stages and the chunk size (back to more reasonable numbers) and we’re not seeing errors we would have absolutely seen before. More tests coming and I’ll keep turning the pressure up, but if this is really fixed it’s hard to understate how important this fix will be for us. All told, three of us have worked for probably two full weeks just trying to make these data imports stable.

Hello,

I confirm the original issue seems to be fixed, i can no longer reproduce the hangs.

Thank you for your support ! I’ll try it to production env with real trafic and keep you updated.

Greetings,

Hello, unfortunately it worked only a couple hours. Now the hang is back (see my script logs below).

The weird thing is that it seems to happen only from one machine. If i run the script from other machines, it doesn’t hang.

When I « flip » the db pool in production from RDS to MPG, it immediately triggers the « hang bug » and fallback to RDS connection, but only from a specific machine.


Payload : 700 KB (SELECT repeat(‘x’, 700*1024))

Iter/conn : 2000

Conns : 3 in parallel (each on its own psql session)

PG stmt to: 10000ms (server-side)

Wall cap : 600s per session

-–

… conn1=282 conn2=85 conn3=257 (total=624/6000)

… conn1=555 conn2=85 conn3=257 (total=897/6000)

… conn1=555 conn2=85 conn3=257 (total=897/6000)

… conn1=555 conn2=85 conn3=257 (total=897/6000)

… conn1=555 conn2=85 conn3=257 (total=897/6000)


I provided more details in the support ticket.

Greetings,

Just adding ideas to the pot: does specifying a (different) charset at connection time make any difference?

Hi, thanks for the idea! Tried it just now (client_encoding=LATIN1), no difference. Makes sense in retrospect: the payload is SELECT repeat('x', 700*1024), pure ASCII either way.

We’ve narrowed it down quite a bit since my last message:

  • Reproduces 100% on one VM, even with a single connection. Two other VMs same app/region/image complete the same test cleanly.

  • Network path checks pass: SELECT 1 returns in 35 ms, ICMPv6 RTT 0.15 ms, MTU clean, TCP connect OK.

  • Bug only triggers under sustained large-response load. statement_timeout (server-side) does not fire.

  • At hang: ESTAB, Recv-Q=0, Send-Q=0, no FIN/RST, lastsnd/lastack stop advancing.

  • Smoking gun: every stalled session shows snd_wnd=69632 (peer-advertised receive window clamped to exactly 64 KiB + headers), while idle sessions on the same path show 1–2 MB. 100% correlation between the clamp and the eventual hang. That very specific value across sessions points to something per-session on your end rather than client-side.

Greetings,

The single machine was unfortunately landed on a host that missed my deployment last week; this should now be fixed as well. Sorry about that!

Hello, everything is working fine now. Thank you !
Is this a definitive fix for this issue ?

Yes, the fix was definitive last week, just that there were a few physical hosts that missed the rollout. You were unfortunate to hit one of those with one of the machines.

Ok thank you very much for the follow up.