fly-proxy now routes around unstable network links

Anycast networking is great. At Fly, Anycast allows us to route requests from your users to one of our edge servers closest to them, and magic™️ happens before the requests reach one of the (hopefully) many machines you have on our platform.

BGP is flexible enough that we can transparently change how requests reach our edges if one of our upstream network providers experiences issues. Until now, something similar to this cannot be done easily once the request enters our internal network. We run an internal Wireguard mesh where each pair of servers are at most one “hop” away, and once we decide on which machine to route a request to, it is sent directly to the server hosting it. If that fails, we would back off and retry, either with the same machine or a different one.

Direct routing is most likely optimal, but as we expand to more and more regions, short blips of network instability between two regions are sometimes unavoidable despite high-quality upstream networks. When these short blips happen, say, between Asia and Europe, our edges in Asia trying to reach nodes in Europe would simply try to wait it out, resulting in slow or even failed responses. Although rare, these do happen from time to time, and we have heard from y’all that seeing these slow / failed responses do not feel great, for you and your users.

To reduce these occurrences, we are now working to expand routing capabilities of fly-proxy, our frontend proxy that handles all incoming HTTP(S) requests and TCP connections. The first step is to allow HTTP(S) requests to be routed through another hop in the middle when needed, so that during the aforementioned network blips, we can temporarily and automatically route around the problematic link and maintain some level of connectivity between the two regions in question. This fallback connectivity would not be optimal, but most of HTTP(S) requests should continue to flow even though they might be slightly slower. Once the link returns, fly-proxy will automatically switch back to direct routing.

We are working to slowly roll out this feature across our server fleet. Currently, we have enabled fallback routing on edges in our Singapore region (sin) due to its recent history of network instability, and we are already seeing the new capability kicking into action when edges in sin fail to connect to regions that are far away. In the coming weeks, we are going to enable this in more and more regions, and we also plan to expand the capability to raw TCP connections and Flycast addresses. Stay tuned!

(For those curious: if you set the header flyio-debug: doit for a HTTP(S) request, we now include a new field called fbn in the flyio-debug response header. This records the extra fallback hop the request went through, or null if it didn’t. It is exceedingly rare for any individual request to hit the fallback case, but if it did, you now also have a way to peek inside our routing decisions!)

18 Likes