Fly Proxy Dropping Requests With 400 with Connection: close requests

zwhitchcox · May 4, 2026, 5:40pm

This started happening around April 22 for us.

Summary

Intermittent empty 400 Bad Request from Fly Proxy before the app handles the request.

This reproduces on both prod and staging Fly apps when rapid Node/undici requests use Connection: close. The same requests succeed with keep-alive or with a 5-second delay between requests.

Apps

prod app: findgood-work-f4fb
staging app: findgood-work-f4fb-staging

Endpoint

GET /hq/service-titan-proxy/jpm/v2/tenant/1053411235/jobs/316870122/notes?page=1&pageSize=5

Prod host:

https://hq.hepisontheway.com

Staging host:

https://hep.staging-findgood.work

Client Conditions

Fails with Node fetch/undici when sending:

Connection: close
flyio-debug: doit
Authorization: Bearer <token>
Accept: application/json

Does not fail with:

Connection: keep-alive / default Node fetch pooling

Does not fail when adding a 5-second delay between close-style requests.

Observed Staging Result

50 rapid attempts against staging:

200, 400, 200, 400, 200, 400, 200, 400, 200, 400,
200, 400, 200, 400, 200, 400, 200, 400, 200, 400,
200, 400, 200, 400, 200, 400, 200, 400, 200, 400,
200, 400, 200, 400, 200, 400, 200, 400, 200, 400,
200, 400, 200, 400, 200, 400, 200, 400, 200, 400

Same staging test with DELAY_MS=5000:

200 x 12

400 Response

status: 400
statusText: Bad Request
bodyLength: 0
content-type: null
server: Fly/9f7e98291c (2026-04-30)
via: 1.1 fly.io
duration: ~90-130ms

200 Response

status: 200
bodyLength: 4429
content-type: application/json; charset=utf-8
duration: ~250-600ms

Staging Failing Request IDs

01KQT0R3N8A7CHKKBRVRNK53Q1-dfw
01KQT0R453YQ70V7BZ3RRFRWRB-dfw
01KQT0R4P9TABQ9TASE3V7QRZ6-dfw
01KQT0R565W6F0VT1Y68V5DQFJ-dfw
01KQT0R5KRCFYNVPWTR62333SR-dfw
01KQT0RDYKY8DGREJ248TJTF7H-dfw

Prod Failing Request IDs

01KQT0AD4DSSJY750QMK085FRD-dfw
01KQT0ADFXETNW9WFZV3FPT5RA-dfw
01KQT0ADTRW7Z3H827ZDZ354P5-dfw
01KQT0AEE25S4JR10V250EGVM1-dfw
01KQT0AEGS143QZG9T018FF9C2-dfw
01KQT0AEWB607MD0HZ4EB03CNT-dfw
01KQT0AHHGFJSJNMGZ7M8E8DG6-dfw
01KQT0AJDCSAZ5RCVG8EBQQX7C-dfw

Example Staging `flyio-debug` for Failing 400

{
  "n": "edge-cf-dfw1-2432",
  "nr": "dfw",
  "ra": "162.81.188.54",
  "rf": "Verbatim",
  "sr": "dfw",
  "sdc": "dfw1",
  "sid": "32872dd3f70438",
  "st": 0,
  "nrtt": 0,
  "bn": "worker-lsh-dfw1-f574",
  "mhn": null,
  "mrtt": null
}

Example Prod `flyio-debug` for Failing 400

{
  "n": "edge-cf-dfw1-9a88",
  "nr": "dfw",
  "ra": "162.81.188.54",
  "rf": "Verbatim",
  "sr": "dfw",
  "sdc": "dfw1",
  "sid": "2872470a143078",
  "st": 0,
  "nrtt": 0,
  "bn": "worker-dp-dfw1-fe80",
  "mhn": null,
  "mrtt": null
}

App Log Evidence

fly logs -a findgood-work-f4fb --no-tail shows only 200 morgan entries for the endpoint during the repro window. The 400 attempts do not appear in app logs.

This suggests Fly selects a machine (sid present) but returns 400 before the request reaches Express/morgan.

Non-Fly Comparisons

Local app proxy:

http://hep.localhost.localhost:3000/hq/service-titan-proxy/...

Results:

100/100 200 with Connection: close

Direct upstream ServiceTitan requests also did not reproduce.

Question for Fly

Why does Fly Proxy return empty 400 for rapid HTTP/1.1 close-style Node/undici requests after selecting a machine, while the app never logs the failed request?

Is this a known Fly Proxy connection teardown/reuse issue around Connection: close?

Harisnizami94 · May 4, 2026, 10:59pm

Hey, not a Fly engineer but the mhn: null in your debug output is interesting, proxy selected a machine but never routed to it. Looks like Connection: close is creating a teardown window that rapid requests keep falling into, which would explain the perfect alternating pattern.

Forcing keep-alive via undici’s Pool should stop it in prod for now. But those request IDs you shared, worth emailing support@fly.io directly, they should be able to trace exactly what the proxy did at each of those points and whether something changed around April 22.

zwhitchcox · May 5, 2026, 12:55am

Thanks, yeah, that’s our current workaround. I’ll email support though, too, did not realize that’s the proper channel, thanks again.

bglw · May 5, 2026, 9:48pm

One note for the future: mhn: null is fine. This just means there was no multihop node the proxy routed through, as the edge and worker hosts were in the same region.