Users reported intermittent 502 errors and slow page loads our application. The frontend runs on Vercel (region fra1), the backend API on Fly.io. To isolate the issue, a diagnostic endpoint was added that tests the connection directly from within the Vercel runtime.
Test method
Endpoint: GET /api/conn-test — runs as a Next.js route handler on Vercel (runtime: nodejs, region: fra1).
Each test run makes 6 consecutive fetch requests to https://brandhub-api.fly.dev/health via three paths:
- default — standard Node.js
fetch(shares undici connection pool with the proxy) - ipv4 — fresh
undici Agent({ connect: { family: 4 } }), no cached connections - ipv6 — fresh
undici Agent({ connect: { family: 6 } })
Each request has an AbortSignal.timeout of 8000ms.
Test results
Measurement 1 — baseline (machine in region ams)
"default": [
{ "ok": true, "status": 200, "ms": 21 },
{ "ok": false, "err": "UND_ERR_CONNECT_TIMEOUT", "ms": 5471 },
{ "ok": true, "status": 200, "ms": 825 },
{ "ok": true, "status": 200, "ms": 45 },
{ "ok": true, "status": 200, "ms": 15 },
{ "ok": true, "status": 200, "ms": 13 }
],
"ipv4": [
{ "ok": false, "err": 23, "ms": 8001 },
{ "ok": false, "err": 23, "ms": 8001 },
{ "ok": false, "err": 23, "ms": 8000 },
{ "ok": true, "status": 200, "ms": 49 },
{ "ok": true, "status": 200, "ms": 19 },
{ "ok": true, "status": 200, "ms": 14 }
]
Measurement 2 — after tightening health checks (interval 15s, unhealthy_threshold=1)
"ipv4": [
{ "ok": true, "status": 200, "ms": 28 },
{ "ok": false, "err": 23, "ms": 8001 },
{ "ok": true, "status": 200, "ms": 27 },
{ "ok": false, "err": 23, "ms": 8001 },
{ "ok": true, "status": 200, "ms": 36 },
{ "ok": false, "err": 23, "ms": 8001 }
]
Pattern: strictly alternating ok/timeout. Confirms that 2 proxy nodes are in rotation, one of which cannot reach the machine.
Measurement 3 — after machine migration to region fra (same region as Vercel fra1)
"default": [
{ "ok": false, "err": "UND_ERR_CONNECT_TIMEOUT", "ms": 5494 },
{ "ok": true, "status": 200, "ms": 50 },
{ "ok": true, "status": 200, "ms": 12 },
{ "ok": true, "status": 200, "ms": 7 },
{ "ok": true, "status": 200, "ms": 7 },
{ "ok": true, "status": 200, "ms": 6 }
],
"ipv4": [
{ "ok": true, "status": 200, "ms": 17 },
{ "ok": false, "err": 23, "ms": 8002 },
{ "ok": true, "status": 200, "ms": 7 },
{ "ok": false, "err": 23, "ms": 8002 },
{ "ok": true, "status": 200, "ms": 7 },
{ "ok": true, "status": 200, "ms": 76 }
]
Measurement 4 — follow-up (same setup)
"default": [
{ "ok": false, "err": "UND_ERR_CONNECT_TIMEOUT", "ms": 5508 },
{ "ok": false, "err": "UND_ERR_CONNECT_TIMEOUT", "ms": 5497 },
{ "ok": true, "status": 200, "ms": 75 },
{ "ok": true, "status": 200, "ms": 19 },
{ "ok": true, "status": 200, "ms": 7 },
{ "ok": true, "status": 200, "ms": 14 }
],
"ipv4": [
{ "ok": false, "err": 23, "ms": 8001 },
{ "ok": false, "err": 23, "ms": 8001 },
{ "ok": true, "status": 200, "ms": 16 },
{ "ok": false, "err": 23, "ms": 7999 },
{ "ok": true, "status": 200, "ms": 7 },
{ "ok": true, "status": 200, "ms": 14 }
]
Analysis
DNS
A:168.220.92.32(dedicated Fly IPv4, global anycast)AAAA:ENODATA— no public IPv6 address
IPv6
All attempts fail immediately with EBUSY (<5ms). Vercel fra1 has no working IPv6 stack towards Fly. Not a factor in this issue.
Pattern
The ipv4 test uses a fresh undici Agent on every run with no cached connections. This simulates cold-start serverless function invocations, where no persistent connection pool exists between invocations.
The strictly alternating ok/timeout pattern (measurement 2) is diagnostic: it confirms that Fly’s Anycast proxy infrastructure has two nodes handling requests for this IP in round-robin rotation. Node A can reach the machine; node B cannot. Every new TCP connection is routed to one of the two nodes in turn.
The default test shares the undici connection pool and reuses existing TCP connections, which makes the failure rate appear lower (~1/6 instead of ~2-3/6). However, in a serverless environment (Vercel) there is no persistent connection pool between function invocations — every cold start creates new connections, each with the same ~50% chance of hitting the broken proxy node.
Effect of the ams → fra region migration
Successful connections are significantly faster (6-12ms vs. 13-45ms). The intermittent timeout pattern persists after migration, which confirms the problem lies in Fly’s proxy layer and is not related to physical distance to the machine.