Fly.io connectivity issues

Users reported intermittent 502 errors and slow page loads our application. The frontend runs on Vercel (region fra1), the backend API on Fly.io. To isolate the issue, a diagnostic endpoint was added that tests the connection directly from within the Vercel runtime.

Test method

Endpoint: GET /api/conn-test — runs as a Next.js route handler on Vercel (runtime: nodejs, region: fra1).

Each test run makes 6 consecutive fetch requests to https://brandhub-api.fly.dev/health via three paths:

  1. default — standard Node.js fetch (shares undici connection pool with the proxy)
  2. ipv4 — fresh undici Agent({ connect: { family: 4 } }), no cached connections
  3. ipv6 — fresh undici Agent({ connect: { family: 6 } })

Each request has an AbortSignal.timeout of 8000ms.

Test results

Measurement 1 — baseline (machine in region ams)

"default": [
  { "ok": true,  "status": 200, "ms": 21 },
  { "ok": false, "err": "UND_ERR_CONNECT_TIMEOUT", "ms": 5471 },
  { "ok": true,  "status": 200, "ms": 825 },
  { "ok": true,  "status": 200, "ms": 45 },
  { "ok": true,  "status": 200, "ms": 15 },
  { "ok": true,  "status": 200, "ms": 13 }
],
"ipv4": [
  { "ok": false, "err": 23, "ms": 8001 },
  { "ok": false, "err": 23, "ms": 8001 },
  { "ok": false, "err": 23, "ms": 8000 },
  { "ok": true,  "status": 200, "ms": 49 },
  { "ok": true,  "status": 200, "ms": 19 },
  { "ok": true,  "status": 200, "ms": 14 }
]

Measurement 2 — after tightening health checks (interval 15s, unhealthy_threshold=1)

"ipv4": [
  { "ok": true,  "status": 200, "ms": 28 },
  { "ok": false, "err": 23, "ms": 8001 },
  { "ok": true,  "status": 200, "ms": 27 },
  { "ok": false, "err": 23, "ms": 8001 },
  { "ok": true,  "status": 200, "ms": 36 },
  { "ok": false, "err": 23, "ms": 8001 }
]

Pattern: strictly alternating ok/timeout. Confirms that 2 proxy nodes are in rotation, one of which cannot reach the machine.

Measurement 3 — after machine migration to region fra (same region as Vercel fra1)

"default": [
  { "ok": false, "err": "UND_ERR_CONNECT_TIMEOUT", "ms": 5494 },
  { "ok": true,  "status": 200, "ms": 50 },
  { "ok": true,  "status": 200, "ms": 12 },
  { "ok": true,  "status": 200, "ms": 7 },
  { "ok": true,  "status": 200, "ms": 7 },
  { "ok": true,  "status": 200, "ms": 6 }
],
"ipv4": [
  { "ok": true,  "status": 200, "ms": 17 },
  { "ok": false, "err": 23, "ms": 8002 },
  { "ok": true,  "status": 200, "ms": 7 },
  { "ok": false, "err": 23, "ms": 8002 },
  { "ok": true,  "status": 200, "ms": 7 },
  { "ok": true,  "status": 200, "ms": 76 }
]

Measurement 4 — follow-up (same setup)

"default": [
  { "ok": false, "err": "UND_ERR_CONNECT_TIMEOUT", "ms": 5508 },
  { "ok": false, "err": "UND_ERR_CONNECT_TIMEOUT", "ms": 5497 },
  { "ok": true,  "status": 200, "ms": 75 },
  { "ok": true,  "status": 200, "ms": 19 },
  { "ok": true,  "status": 200, "ms": 7 },
  { "ok": true,  "status": 200, "ms": 14 }
],
"ipv4": [
  { "ok": false, "err": 23, "ms": 8001 },
  { "ok": false, "err": 23, "ms": 8001 },
  { "ok": true,  "status": 200, "ms": 16 },
  { "ok": false, "err": 23, "ms": 7999 },
  { "ok": true,  "status": 200, "ms": 7 },
  { "ok": true,  "status": 200, "ms": 14 }
]

Analysis

DNS

  • A: 168.220.92.32 (dedicated Fly IPv4, global anycast)
  • AAAA: ENODATA — no public IPv6 address

IPv6

All attempts fail immediately with EBUSY (<5ms). Vercel fra1 has no working IPv6 stack towards Fly. Not a factor in this issue.

Pattern

The ipv4 test uses a fresh undici Agent on every run with no cached connections. This simulates cold-start serverless function invocations, where no persistent connection pool exists between invocations.

The strictly alternating ok/timeout pattern (measurement 2) is diagnostic: it confirms that Fly’s Anycast proxy infrastructure has two nodes handling requests for this IP in round-robin rotation. Node A can reach the machine; node B cannot. Every new TCP connection is routed to one of the two nodes in turn.

The default test shares the undici connection pool and reuses existing TCP connections, which makes the failure rate appear lower (~1/6 instead of ~2-3/6). However, in a serverless environment (Vercel) there is no persistent connection pool between function invocations — every cold start creates new connections, each with the same ~50% chance of hitting the broken proxy node.

Effect of the ams → fra region migration

Successful connections are significantly faster (6-12ms vs. 13-45ms). The intermittent timeout pattern persists after migration, which confirms the problem lies in Fly’s proxy layer and is not related to physical distance to the machine.

Hi there!

Is there any way you could increase that 8000ms timeout to over 10 seconds (say 15000ms to be on the safe side) and repeat your test? Let me know whether that timeout was hiding some other kind of error.

Also, is there any way to launch an arbitrary request from the affected frontend to https://debug.fly.dev and share the full output of that? That’ll tell us which Fly.io edge your requests are going through. debug.fly.dev output is non-sensitive so it’s OK to share it in full here - the possible exception is the fly-client-ip, feel free to scrub that if necessary.

Hi,

Here are the results with the 15-second timeout and the debug.fly.dev output.

debug.fly.dev output

=== Headers ===
Host: debug.fly.dev
Fly-Request-Id: 01KVB36PX53QQBW4MRKHFQTWVC-fra
Fly-Client-Ip: [scrubbed]
X-Forwarded-For: [scrubbed], 37.16.21.10
Via: 1.1 fly.io
Fly-Region: fra

Requests are going through the Frankfurt (fra) edge, as expected.

Conn-test results (15s timeout)

“default”: [
{ “ok”: true, “status”: 200, “ms”: 48 },
{ “ok”: true, “status”: 200, “ms”: 20 },
{ “ok”: true, “status”: 200, “ms”: 8 },
{ “ok”: true, “status”: 200, “ms”: 8 },
{ “ok”: true, “status”: 200, “ms”: 11 },
{ “ok”: true, “status”: 200, “ms”: 38 }
],
“ipv4”: [
{ “ok”: false, “err”: “ECONNRESET”, “ms”: 10055 },
{ “ok”: false, “err”: “ECONNRESET”, “ms”: 10057 },
{ “ok”: true, “status”: 200, “ms”: 21 },
{ “ok”: true, “status”: 200, “ms”: 14 },
{ “ok”: true, “status”: 200, “ms”: 6 },
{ “ok”: true, “status”: 200, “ms”: 6 }
]

What the higher timeout revealed

The error was previously masked by our 8s timeout. With 15s, the actual error surfaces: ECONNRESET at ~10 seconds. The connection is being established but then
reset before the TLS handshake completes, exactly matching the pattern in community thread #28120
( Intermittent "Client network socket disconnected before secure TLS connection was established" (ECONNRESET) on inbound HTTPS ).

The default fetch (which reuses the undici connection pool) is now 6/6 successful, while the ipv4 test (fresh Agent, no pool reuse, simulating cold-start
serverless invocations) shows 2 ECONNRESET errors consistently on the first two attempts before recovering.

The second IP in X-Forwarded-For is 37.16.21.10, that may help identify the specific edge node involved.

Hope this helps narrow it down. Let us know if you need anything else.

Hi!

Thanks so much for this. There’s likely an edge in fra having trouble fetching certificates and ultimately timing out. The cert fetch timeout is 10 seconds which matches the ECONNRESET times you’re seeing.

I’ll try to update here when it’s fixed but it should also be apparent as you’ll see errors drop off once we fix it. Shouldn’t take long!

Great, thanks again for the quick reply. Looking forward to your update!

Hi, this should be fixed now.