So, why don’t I talk a little more specifically about how UDP works here, and if you all can spot easy wins that we could score, I’m totally open to them. It’s been a BPF-y week for me, so I’m up for doing some stuff here.
The annoying thing about UDP protocols are that proxied UDP packets don’t carry source addresses, and all UDP protocols depend on those to send responses. Our UDP forwarding pushes and pops proxy headers onto UDP packets to forward them.
We have two forwarding paths for our Anycast network. The most important of them is
fly-proxy, which is at bottom a straightforward socket proxy.
fly-proxy forwards almost all the TCP traffic in our network and does all the HTTP dances. The other forwarding path is BPF, which is what we use for UDP. UDP never hits
fly-proxy; in fact, it never hits userland on any of our systems at all.
In-kernel UDP forwarding comes with some limitations. A mostly accurate way to think about how BPF packet processing works is, we can’t write loops in the kernel. The kernel communicates with userland using “maps”, which are essentially a kernel-resident Redis (roughly similar, if more primitive, data types). Any decision-making we do in the UDP forwarding path has to be represented in a map somehow.
Our current forwarding maps are pretty simple. We dispatch packets based on their target IP addresses. Userland code keeps a master forwarding map populated on each host on our network with a mapping of app IP address (if that app has UDP services) to “nearest” responsive worker for that map.
That’s essentially it! On the worker side, after packets are processed at the edge, we keep some state — we want to route UDP responses back through the same path they took inbound, so when we process an incoming UDP packet on a worker we update a second map saying “responses to this source address should go to this edge host”.
We can add I guess arbitrary hash lookups, and we can have different hash lookups on our edge (where incoming packets arrive) and on our workers. We can chain hash tables together (there’s a map that is “map of maps”). But we can’t loop over a chain; whatever the chain is, it has to be expressed in essentially straight-line code.
Most of the work we’re talking about here will have to happen on edges (by the time a packet hits a worker we’ve already decided the path it’s taking).
We can do lookups on ports (or any other packet data that is at a fixed offset in the packet).
It would be difficult to do arbitrary port mappings; when figuring out what to do with an incoming packet based on port, I essentially have to be able to make a decision based on a fixed number of map lookups. It’s hard to have an array of port mappings and a foreach over them.
It’s also painful to keep per-source state. We do that on the worker because the penalty for evicting state and then referencing it is low (in the worst case, if we process an incoming packet on a worker and then lose its state before the reply is generated, we just do direct server response from that worker instead of honoring the ingress path). But if we’re using it to pin sessions to workers, when we lose state, we break the underlying session.
I’m cool with talking about this, especially if y’all have ideas on how we might think about implementing stuff. I just want to make sure we’re open about what our limitations are.