If you have read our documentation and our many blog posts on the topic, you know that under the hood, all Fly machines are connected by a mesh of Wireguard tunnels between all servers. This is what gives us great features like seamless private networking via 6PN addresses assigned to every machine, regardless of where they are. It is also the default secure channel we use to communicate between all of our servers: things like Corrosion, our metrics and logs cluster, and fly-proxy, our Anycast load balancer, all run their traffic mainly through Wireguard.
The Problem with Wireguard
The last component mentioned above, fly-proxy, is the main user of the Wireguard mesh when sorted by traffic. When a connection hits one of our edge nodes, it is handled by fly-proxy and forwarded through the Wireguard mesh, and one or more hops of other nodes running fly-proxy, to a machine chosen by the load-balancing algorithm. In order to effectively balance traffic, the edge proxy has to terminate TLS connections, which means that the Wireguard mesh has been the only secure option to forward that traffic further.
This has worked exceptionally well for us: the proxy is kept simple and traffic remain nicely protected by Wireguard’s cryptography. Over the years, though, it has begun to show some strain as we started receiving more and more traffic. I will spare you from all the details from our investigation into performance issues we have run into here, but it mainly boils down to two related problems:
- The virtual interface created by Wireguard has only 1 rx (receive) and 1 tx (transmit) queues, which means that they are bound by how fast a single thread can loop through all the packets.
- Between each pair of servers, all traffic through Wireguard can only land in one rx queue of the underlying NIC (network card). This is because the 4-tuple
(src ip, src port, dst ip, dst port)of Wireguard traffic always stays the same for any given pair of servers as each Wireguard tunnel only makes use of a single pair of source and destination ports.
A combination of these means that the Wireguard interface tends to get saturated long before the NICs equipped on our edge servers are. The second problem can also be painful from time to time, since a high-traffic app with a small number of Machines can saturate the Wireguard link between an edge and a “worker” (the server hosting a Fly Machine), without saturating the entire virtual interface.
Over the past few months, there have been a number of incidents where this became the main limiting factor of edge performance, and is the reason why we introduced per-app bandwidth limiting to ensure fairness between apps and to prevent one app from saturating a link that should be shared with all other apps. The initial limits we applied were quite harsh since we knew where we start to run into problems, but we understood that it is not really a great experience for many users who expect to be able to run a lot of bandwidth to their apps, and we have been working towards being able to increase that limit.
Removing Wireguard from the Equation
There is no escape: if we want to be able to increase the bandwidth limit, we must be able to handle a lot more traffic than we were able to before. We had only a few options:
- Figure out what was wrong with our Wireguard setup, and either fix our configuration or make improvements to Wireguard for our use case;
- Use a multi-tunnel setup of Wireguard, which essentially avoids both of the single-queue bottlenecks;
- Move the hot path (
fly-proxytofly-proxyforwarding) off Wireguard and free Wireguard up for other kinds of traffic (mainly 6PN, which is a much smaller fraction compared to whatfly-proxyoften needs).
We have burned some hours on (1) without much success. We were able to try switching to some userspace Wireguard implementations (like wireguard-go, which Tailscale folks spent a lot of time optimizing) and get some improvements that way, but applying that across the fleet would require rolling downtime for all servers (since, again, fly-proxy hard-depends on Wireguard to function). We also discussed (2), but finally decided that with all the complexity it might add to fly-proxy or our L3 network stack (we’ll need L3 bonding if we implement this outside of the proxy), we might as well fix our immediate problem at hand with (3).
Beginning around the end of last year, we started to gradually introduce a new kind of peer-to-peer connection between fly-proxy instances that avoids the Wireguard mesh. These are TLS 1.3 endpoints with authentication for both sides of the connections: peers always verify the identity of each other using their public keys, managed exactly like our Wireguard keys (and not by managing a new internal CA). They also have a hard-coded set of cipher suites available to each other: in short, we want this to “feel” as close to the Wireguard interfaces as we can. But because we can now create multiple TCP connections outside of the Wireguard mesh, we are no longer subject to the scaling limitations, and both the NICs and Linux’s network stack are well-optimized for these bog-standard TCP connections.
This was first enabled between a few busy regions that were showing intermittent issues near the end of last year. We were able to increase the bandwidth limits way past the point where we started to drop / timeout some connections before. Since the new year, we have gradually rolled this out to all regions, which has been fully completed on Friday. Our bandwidth limits are still rather conservative at the moment, but we will be starting to increase them over the coming weeks – best of all, there is nothing you need to do to benefit from all of this! Of course, there are still bottlenecks in the rest of our network stack, and we will continue working on improving them in the future.