Anyone using fly.io from India / Sri Lanka / Pakistan regions - beware of poor routing

I have been observing very poor routing from these regions for the past month or so. For instance, from India traffic gets routed to fly’s cdg and fra ingress proxy a on Airtel network. On Reliance Jio network the situation is slightly better, where only 50% of the time the traffic is routed via the sin or bom ingress proxy.

I have been in touch with Fly’s support team throughout, but unfortunately there is no resolution so far.

I find this to be CRAZY! how can anycast go so wrong that it’s routing traffic via regions so far away when atleast 3 closer regions are available.

I am just making this post to check if people hosting services that have traffic coming in from India and neighbouring regions have noticed this behaviour.

The way to check this would be, hit any of your app endpoints on fly using the following request

curl -I https://[your-app-name].fly.dev -H "flyio-debug: doit"

You would get a response that looks something like this (I don’t have any API / resource on the base URL so I get a 403 from my app server, but that’s not the important bit)

HTTP/2 403
date: Wed, 17 Jan 2024 23:49:45 GMT
...
...
server: Fly/f9c163a6 (2024-01-16)
via: 2 fly.io
flyio-debug: {"n":"edge-nac-fra1-3f1d","nr":"fra","ra":"106.222.202.60","rf":"Verbatim","sr":"sin","sdc":"sg1","sid":"3d8ddd5f721989","st":0,"nrtt":244,"bn":"worker-pkt-sg1-96d6"}
fly-request-id: 01HMCX9B7Z97QEN2SX2DZJNEHW-fra

the edge through which your traffic is getting routed is the “n” param, and the region is “nr”. For me this means that my traffic is going via the “fra” region in for this request. I am based in Bangalore, India and for this particular case my server was running in “sin”. So the traffic went all the way to Frankfurt and then back to Singapore. I get either “cdg” (Paris) or “fra”, roughly 50/50 for my requests.

The whole value prop of fly to me was that I would be able to run things on the edge. But with such poor routing, it is having an adverse effect. This has been going on since Jan 3rd 2024. Before that the routing used to work fine. Initially fly’s support team was investigating this actively, but they could not do a fix, and now they’ve told me that there is no timeline to fix this.

If anyone other than the support team in fly reads this, I would request to address this with urgency. India is not a small geo. And if you want to win potential startup clients here, of which there are thousands upon thousands, this should not be ignored.

My request to the readers: if you are running services that are servicing the India and surrounding regions, please check your routing using the above request and post your findings here.

[Edit]
Some supporting data:

  • Reliance Jio and Airtel are the two largest ISPs in India with a combined market share of anywhere from 70% to 80%
    source - List of internet service providers in India - Wikipedia, Market Share Of ISP's
  • My personal tests using proxy networks suggest 100% incorrect routing on Airtel, and 90% on Reliance. These are not very largescale though, data points are about a 100 different city + ISP combos in India.
2 Likes

I recently had an email support thread with @eli(unless there’s a different one) about this.

Same situation as there is roughly 100 GB of traffic split between US / India. However, most of the IN traffic gets re-routed to FRA. It appears to me that the traffic gets split between FRA / SIN at peak hours for IN. The BOM edge is completely unused.

image

image

Last April, This is only possible on fly... I posted this image and MAA was working. Later, the MAA instance was moved to SIN as it was actually faster and served Asia better.

April 2023:

Email response to me:

The crux of the matter is: intermediate ISPs often don't support
anycast traffic well, so in this case, routing config changes won't
resolve the issue. Instead, we need to turn up more sites nearby to
announce from — without this in place, we would see significant
disruption to our global traffic pattern.

Since this involves new hardware, it's more time-consuming than
tweaking bgp settings, but it is something that we are working to do
as quickly as we can— this is a top priority for us. 

There is no clear timetable / fix for this atm as hardware will need to be deployed. It looks like they decommissioned the MAA site. I’m really hoping they get it working…and keep it working this time.

2 Likes

thanks for corroborating, where are these graphs from btw. are these your internal dashboards?

The map with bandwidth screenshots are from https://fly-metrics.net/ which is the Grafana that is included by default.

I would suggest you look at the Data Out and put machines where the edges are.
image

I just went through and broke down the “other” edge traffic and placed instances there.

The country breakdown screenshot is from an analytics software https://plausible.io/

Regarding ping times, I’ve always had faster times to non-Indian instances, even when MAA existed with my ad hoc testing.

Currently, ping times are

  • FRA - 200-230ms
  • ORD - 300ms
  • Las Vegas VM 300+ms
  • Europe VM 200-230ms

So it really isn’t all that great.

Looks like your testing is far more through. Can you share your proxy tests / scripts? Will BOM be fast? SIN was faster than MAA last year, so I’m not sure.

In my case, I have an API running in the sin region. And I am recording the edge proxy through which my server receives traffic. 20-25% of traffic to my app is from India. And for majority these users, the traffic gets routed via cdg or fra edge proxies and then reaches sin. I just got curious right now and pulled some info

So my traffic is getting routed on average via routes that are 3 to 9 times as long vs the better options. It corroborates with the latency as well, instead of getting latencies that would be ~30ms (for sin servers), I get somewhere around ~150ms

The problem is, even if I do host a server in bom, the roundtrip happens via cdg or fra, so it does not help things at all.

For my tests using proxies, I bought some ‘residential’ proxy usage via https://geonode.com/, and hit a ping endpoint on my servers. That endpoint was also deriving information about the ISP using http://ip-api.com using the proxy’s IP address ( Fly-Client-IP header)

The test/benchmark code is a bit mixed up with my private code, but I can pull it out and put it in a public repo if you (or anyone else) wants. Some sample data that I get using the script :

1 Like

:rofl:

And now I’m going to Sydney…200ms.

flyio-debug: {"n":"edge-cf-syd1-777c","nr":"syd","ra":"106.208.152.149","rf":"Verbatim","sr":"syd","sdc":"sy4",
"sid":"3d8d741dc1d098","st":0,"nrtt":0,"bn":"worker-pkt-sy4-c72b"}

image

Mumbai is not even an edge in the past few hours. :rofl:

I’m not kidding, my app felt slightly faster. TIL

:man_facepalming:

1 Like

Yup, it is totally faster. It would be nice to just stop at SIN though. :slight_smile:

:wave: Thank you for posting the data here! This is awesome.

For what it’s worth, we’re aware that routing from India is currently a bit of a mess. You’re seeing traffic get bounced around because we’re actively bringing up a bunch of additional anycast edges all over the planet to help. We’re working closely with our upstream providers to get better visibility here and to tweak our BGP advertisements to pull your traffic to a closer POP.

1 Like

Well, tons of traffic was going to SYD and now we’re back to going to FRA. Keep at it I guess. :slight_smile:

Past 24 hours:
image

Yup, back to the same situation on my networks as well.

1 Like

Real-world chrome usage from https://pagespeed.web.dev . The TTFB has to be really bad for Indian users for it to be 1.1s average over all countries. :cry:

image

oh c’mon fly! this is NOT cool. Running servers close to users is literally your pitch :angry::angry::angry::angry:

I do not know how you measure your routing efficacy, but it ain’t doing the job if this is not ringing an alarm somewhere. Many weeks (soon to be months) have gone by with this issue.

TBH At that point, I’d just put a CDN in front of fly and call it a day. Routing is hard, and for anycast to work correctly, you need both a lot of locations and a lot of peering/transit contracts with local carriers. Traditional CDNs should be way better at managing routing, and you have more control over which DCs to connect to.

That doesn’t work for me as I’m not delivering static content. I’m hosting a multiplayer game app. I would suggest that if a CDN would solve someone’s needs, they should simply stay away from Fly.

CDN delivers dynamic content just fine, and is a valid use case as per AWS docs (as long as it’s HTTP(S) based).

Fly does provide other advantages excluding region support, so I wouldn’t say

If your game uses WebSockets or SSE for communication, I suggest trying Cloudfront with caching disabled. It can be quite surprising.

So the thing is, even after a CDN proxy is put in between fly and the user, how does it ensure that routing will happen any better. My Bangalore traffic might hit a Mumbai PoP from aws/cloudflare, but then when it forwards to fly, fly might still ingress it via cdg. :person_shrugging:

I’m not talking about a small edge case in some weird region. This is happening for 100’s of millions of users.

That’s all besides the point though. My point stands - fly pitches that they run apps close to users. They MUST find ways to send that traffic to those apps in ways that is close for the user as well.

Yes I understand your point. No one can guarantee that AWS would route your users to a specific PoP, but by utilizing origin shield, you can test out multiple shield locations to see which pop has the most likelyhood of having the lowest latency.

I agree with your point that it’s Fly’s job to route users to the correct location. I’m just providing a way to remedy it until fly’s team fixes the issue, as fixing routing issues can take a long time.

2 Likes

Yup… Appreciate the ideas.

I’m just making some noise here in hopes that people notice. I’m a solo dev and don’t want to run lots of infrastructure across multiple clouds.

On a side note:
If I read the rants about fly on the web (esp people who have moved on to other providers), it’s not that people are complaining about not having enough features. It is usually that the promised infrastructure does not work as advertised, or is not stable enough. So I just hope that a product manager somewhere is paying attention to that. I really want fly to succeed. But I just grimmace a little bit everytime I see a “fresh produce” post. Anyhoo, here’s to hoping they can figure it out… :crossed_fingers:

2 Likes