Unexpected routing from an Edge to an App in a different region

I have an app which runs in IAD and FRA Fly regions. In order to ensure that both instances are healthy and reachable, I have external monitoring which runs from places close to those regions, which should hit the local Fly Edge and then the App in the same region. It compares the x-fly-edge-region and x-fly-app-region headers to check that this is the case. This shows up issues with Fly Edges, or when my instances get shuffled to a backup region etc. I’m happy to say that both are very rare!

For the past month or so (since the FRA outage, but that could be coincidental) the monitoring has been telling me that occasionally a request might hit the Fly Edge in IAD but then get forwarded to my app running in FRA, or vice-versa. There is nothing wrong with the apps in either location as far as I can tell, flyctl says they are happy and handling requests. The latency of the response matches up with the traffic having been tunneled across the Atlantic and back.

This happens in bursts of maybe an hour or two, and its never all requests which get forwarded to the other region, some are still handled locally. I’m wondering if it’s some form of load shedding - if CPU in one region is running hot, are requests sent to apps in other regions? Or is it a bug in the Edge to App routing?

If it helps, these are headers from one such cross-continental response just now:

server	Fly/5369a69b (2022-02-23)
date	Fri, 25 Feb 2022 12:47:24 GMT
x-fly-edge-region	iad
x-fly-app-region	fra
fly-request-id	01FWREQXJY00CMQ5RA38V7CR6A-iad

This is unexpected, it looks like a regression in our load balancing. Thank you for bringing it up! We’re going to see if we can fix that.

1 Like

This should now be fixed.

We’ll monitor the situation to see if it happens again though.

2 Likes

Thanks for the quick fix!

I’ll keep an eye on my monitoring and will post in this thread if I see it happen again.

1 Like

I’m resurrecting this old thread because I’m seeing this happen again. It’s been solid since this fix was put in place, with only very occasional inter-region routing which I am putting down to a “disturbance in force”.

But my monitoring starting alerting again about 12 hours ago, with all checks which run from Germany correctly hitting the FRA Edge node, but about 50% of those routing to the App in IAD instead of FRA.

flyctl status looks fine, the app is healthy in both locations with no restarts. These are the headers from one response:

server	Fly/253cbbff (2022-03-23)
date	Tue, 29 Mar 2022 07:36:38 GMT
x-fly-edge-region	fra
x-fly-app-region	iad
fly-request-id	01FZA9NW2AMWK9W96CNANFNHG8-fra

It’s intermittent too. In all cases, the alert resolves itself on the next test run, only to happen again a few minutes later.

I’m still seeing this - it’s been happening for 36 hours now, and it looks like approximately a third of the requests from my monitoring which hit the FRA edge location are reaching my app in IAD and not FRA, and the latency figures match up with this.

The headers of the most recent occurrence:

server	Fly/253cbbff (2022-03-23)
date	Wed, 30 Mar 2022 08:01:38 GMT
x-fly-edge-region	fra
x-fly-app-region	iad
fly-request-id	01FZCXGBPBJQAKWZ8D24DHF6CV-fra

There are some issues with how we measure the round-trip time between our nodes and determine the best path to choose.

It’s fixed for now, but it might veer its head back again :confused: we’re looking into a permanent fix.

Thanks for the update. I haven’t seen any change, it’s still happening, most recently:

server	Fly/253cbbff (2022-03-23)
date	Wed, 30 Mar 2022 17:01:39 GMT
x-fly-edge-region	fra
x-fly-app-region	iad
fly-request-id	01FZDWD5DXY0GE3H2TMT2CS7AT-fra 

Going forward, is it useful for me to post in this thread if I see it happen in the future? Or are you already aware when it happens and just need time to investigate the permanent fix?