Problems with running APIs that call APIs also hosted in the same region of Fly.dev

Hi Fly.io Support,

I’m running into a persistent routing issue between two apps in the same organisation (webhouse) and would appreciate
your help diagnosing it.

Setup

  • webhouse-cronjobs — a cron scheduler app, region: arn
  • webhouse-whop — a monitoring app, region: arn, custom domain whop.webhouse.net

Both apps are in the webhouse org, both deployed to Stockholm (arn).

The problem

When webhouse-cronjobs makes outbound HTTP POST requests to https://whop.webhouse.net, approximately 50% of requests
fail with TypeError: fetch failed in 94–221ms. The other 50% succeed normally.

In webhouse-whop’s logs I can see the corresponding error on the proxy side:

error.message=“could not complete HTTP request to instance: client error (SendRequest)”
proxy fra request.url=“/”

The FRA (Frankfurt) edge proxy is handling the request but failing to forward it to the ARN machine. Since both apps
are in the same org and same region, I would expect traffic to route ARN→ARN directly.

What we tried

  1. fly-prefer-region: arn header on all outbound requests — no reliable improvement, still ~50% failure rate
  2. Dedicated IPv4 (37.16.16.161) replacing the shared IP, plus direct DNS A-records (no CNAME chain) — same failure
    rate
  3. Retry logic (3 retries, 5s fixed delay) — all 3 retries fail on the same request, so the underlying routing issue
    persists across retries
  4. Flycast (http://webhouse-whop.flycast) — allocated a private IPv6 (fdaa:2c:438c:0:1::3), DNS resolves correctly
    from the cronjobs machine, but the connection fails with an unexpected EOF/SSL error. Our app uses [http_service]
    with force_https = true — we suspect this may interfere with flycast even on the private network
  5. .internal hostname on port 3000 — connection refused, likely because Next.js binds to IPv4 only and the .internal
    address is IPv6

Current workaround

We moved from an external scheduler to node-cron running inside webhouse-whop itself, calling http://127.0.0.1:3000
via loopback. This works, but it means we can no longer use webhouse-cronjobs to trigger jobs on webhouse-whop, which
defeats the purpose of having a dedicated scheduler.

We also had to stop HTTP health-probing our other Fly apps (e.g., webhouse-whapi) from webhouse-whop because those
probes intermittently fail with the same FRA routing issue — even though the apps are accessible from outside Fly
without any problems.

Questions

  1. Why does traffic between two apps in the same org and same region (arn) route through the FRA edge proxy at all?
  2. Is there a supported way to make Fly apps in the same org communicate reliably — either via flycast or another
    private networking mechanism — when using [http_service] with force_https = true?
  3. Is the flycast SSL issue a known limitation with [http_service]? Would switching to [[services]] resolve it?

Happy to provide full logs, app names, or any other details. This is affecting our production monitoring
infrastructure.

Thanks,
Christian Broberg
WebHouse

$ dig whop.webhouse.net
;; QUESTION SECTION:
;whop.webhouse.net.		IN	A
;; ANSWER SECTION:
whop.webhouse.net.	251	IN	CNAME	webhouse-whop.fly.dev.webhouse.net.

so whop.webhouse.net is a CNAME. Let’s see what it’s pointing to.

 dig webhouse-whop.fly.dev.webhouse.net

;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;webhouse-whop.fly.dev.webhouse.net. IN	A

Doesn’t resolve at all. So the real question is : how does this work 50% of the time? it should not work at all.

Fairly sure whop.webhouse.net should point to webhouse-whop.fly.dev, and not to webhouse-whop.fly.dev.webhouse.net - if this were an old-school bind zonefile I’d say you’re missing a period at the end of the CNAME record.

whop IN CNAME webhouse-whop.fly.dev.   # Terminating period needed here
                                       # Otherwise the name will be relative to the domain

BTW, you can also get this to work via .flycast but you’d need to disable force_https. All it does is redirect http requests to the https URL.

Cheers,

  • Daniel

I just took a quick look at this – these errors from fra are unrelated to your app making requests in arn. I think they’re due to some external clients whose requests ended up being blocked by the proxy. We should definitely improve the experience there though.

I was going to ask for more details here but @roadmr has pointed out this is likely a DNS issue above.

The recommended solution here is to not host external and internal services in the same app. A Flycast address acts just like a regular IP address and there is currently no way to route them differently (e.g. only allow a subset of services on them) other than to separate them into different machines (process groups) / apps.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.