I have an app which runs in IAD and FRA Fly regions. In order to ensure that both instances are healthy and reachable, I have external monitoring which runs from places close to those regions, which should hit the local Fly Edge and then the App in the same region. It compares the x-fly-edge-region and x-fly-app-region headers to check that this is the case. This shows up issues with Fly Edges, or when my instances get shuffled to a backup region etc. I’m happy to say that both are very rare!
For the past month or so (since the FRA outage, but that could be coincidental) the monitoring has been telling me that occasionally a request might hit the Fly Edge in IAD but then get forwarded to my app running in FRA, or vice-versa. There is nothing wrong with the apps in either location as far as I can tell, flyctl says they are happy and handling requests. The latency of the response matches up with the traffic having been tunneled across the Atlantic and back.
This happens in bursts of maybe an hour or two, and its never all requests which get forwarded to the other region, some are still handled locally. I’m wondering if it’s some form of load shedding - if CPU in one region is running hot, are requests sent to apps in other regions? Or is it a bug in the Edge to App routing?
If it helps, these are headers from one such cross-continental response just now:
server Fly/5369a69b (2022-02-23)
date Fri, 25 Feb 2022 12:47:24 GMT
x-fly-edge-region iad
x-fly-app-region fra
fly-request-id 01FWREQXJY00CMQ5RA38V7CR6A-iad
I’m resurrecting this old thread because I’m seeing this happen again. It’s been solid since this fix was put in place, with only very occasional inter-region routing which I am putting down to a “disturbance in force”.
But my monitoring starting alerting again about 12 hours ago, with all checks which run from Germany correctly hitting the FRA Edge node, but about 50% of those routing to the App in IAD instead of FRA.
flyctl status looks fine, the app is healthy in both locations with no restarts. These are the headers from one response:
server Fly/253cbbff (2022-03-23)
date Tue, 29 Mar 2022 07:36:38 GMT
x-fly-edge-region fra
x-fly-app-region iad
fly-request-id 01FZA9NW2AMWK9W96CNANFNHG8-fra
It’s intermittent too. In all cases, the alert resolves itself on the next test run, only to happen again a few minutes later.
I’m still seeing this - it’s been happening for 36 hours now, and it looks like approximately a third of the requests from my monitoring which hit the FRA edge location are reaching my app in IAD and not FRA, and the latency figures match up with this.
The headers of the most recent occurrence:
server Fly/253cbbff (2022-03-23)
date Wed, 30 Mar 2022 08:01:38 GMT
x-fly-edge-region fra
x-fly-app-region iad
fly-request-id 01FZCXGBPBJQAKWZ8D24DHF6CV-fra
Thanks for the update. I haven’t seen any change, it’s still happening, most recently:
server Fly/253cbbff (2022-03-23)
date Wed, 30 Mar 2022 17:01:39 GMT
x-fly-edge-region fra
x-fly-app-region iad
fly-request-id 01FZDWD5DXY0GE3H2TMT2CS7AT-fra
Going forward, is it useful for me to post in this thread if I see it happen in the future? Or are you already aware when it happens and just need time to investigate the permanent fix?