Regional/Node UDP ranges to alleviate IPv4 limits

I had a potentially weird networking idea and wondered how possible it would be for Fly to implement. Would it be possible to make it so we could specify UDP port ranges that specifically route to individual regions (or better yet, specific nodes)? This might be a way to save on the need for multiple IPv4 addresses when using regional routing on things like LiveKit.

On LiveKit, I can already tell a node to limit the range of UDP ports used by a specific node. However, the Fly UDP implementation is still pretty rudimentary (list one UDP port and all UDP ports are routed with no remapping). I’m not sure if it’s feasible, but I’m thinking something along the lines of:

[[services.ports]]
regions = ["sea"]
protocol = "udp"
port_start = 10000
port_end = 19999

[[services.ports]]
regions = ["ord"]
protocol = "udp"
port_start = 20000
port_end = 29999

[[services.ports]]
regions = ["fra"]
protocol = "udp"
port_start = 30000
port_end = 39999

I’m not sure this would work out of the box with LiveKit, but I think it might as long as the individual LiveKit nodes matched their port allocations based on region.

This is by no means an elegant or perfect solution, but would certainly be a reasonable holdover until more IPv4 addresses can be allocated to all of the regions or until a better IPv6 based solution can be made.

It’s… possible, but that looks pretty fussy, and I worry that it will tangle up our API and infra for a feature that exactly one framework (LiveKit) will actually use.

Right now, our UDP routing actually ignores ports altogether. It’s happening entirely in-kernel (in BPF code) and bypassing our proxies.

To implement this, we’d have to implement an additional layer of maps (to discriminate on ports), and come up with an allocation scheme for the ports themselves — we can’t do arbitrary loops in BPF code. It might be pretty gross to actually build.

Are there other applications where this would be helpful that we can come up with? I’m open to it, but also giving candid feedback. :slight_smile:

Yeah, that’s fair :slight_smile:

It is probably relevant to other services as well, but I get if that isn’t your target market. Really, it’s anything that has to deal with UDP NAT traversal issues; WebRTC certainly being the prime example.

How does the code handle UDP today? If one UDP port is listed in the config, does it just accept all UDP ports and fire them off to the closest instance with no translation?

As an alternative to manually specifying ranges, is there any way it could act more like a stateful firewall? i.e., when an outbound connection occurs from a node, it could 1:1 map that port to the specific instance? I’m not sure how this would work for other types of UDP services, but since WebRTC will initiate an outbound connection for STUN it would let you identify the specific node expecting to receive data on that port and route to it instead of only to the closest node.

Again, I get if this doesn’t fit priorities right now. I was just recently brushing up on the details of NAT traversal from How NAT traversal works · Tailscale and it got me thinking of possible solutions to limited IPv4 space and how to reach specific nodes when behind a regional node balancer like the ones you provide.

Are there other applications where this would be helpful that we can come up with? I’m open to it…

It is useful in cases like letting clients choose server regions.

For ex, some of the clients may want to stick with servers only in EU, or only in California, or only in India. This UDP range trick to pin clients to regions works for those usecases, in particular. Another alternative is to deploy region-specific fly-app per-region…

I’ve mentioned on the forums elsewhere, but I’d also appreciate some form of UDP pinning, that is, connections from the same client ip-port ending up at the same server as packets that arrived before (think connection oriented protocols over UDP, like QUIC, or ones tunneling TCP, like WireGuard).

For both the solutions above, and without having to maintain a map, may be some form of consistent-hashing / rendezvous-hashing client-ip (ignoring the client-port, for now) among VMs in a single region does the trick (but this has other pitfalls like poor load-balancing properties and so could be exposed behind a feature flag)?

So, why don’t I talk a little more specifically about how UDP works here, and if you all can spot easy wins that we could score, I’m totally open to them. It’s been a BPF-y week for me, so I’m up for doing some stuff here.

The annoying thing about UDP protocols are that proxied UDP packets don’t carry source addresses, and all UDP protocols depend on those to send responses. Our UDP forwarding pushes and pops proxy headers onto UDP packets to forward them.

We have two forwarding paths for our Anycast network. The most important of them is fly-proxy, which is at bottom a straightforward socket proxy. fly-proxy forwards almost all the TCP traffic in our network and does all the HTTP dances. The other forwarding path is BPF, which is what we use for UDP. UDP never hits fly-proxy; in fact, it never hits userland on any of our systems at all.

In-kernel UDP forwarding comes with some limitations. A mostly accurate way to think about how BPF packet processing works is, we can’t write loops in the kernel. The kernel communicates with userland using “maps”, which are essentially a kernel-resident Redis (roughly similar, if more primitive, data types). Any decision-making we do in the UDP forwarding path has to be represented in a map somehow.

Our current forwarding maps are pretty simple. We dispatch packets based on their target IP addresses. Userland code keeps a master forwarding map populated on each host on our network with a mapping of app IP address (if that app has UDP services) to “nearest” responsive worker for that map.

That’s essentially it! On the worker side, after packets are processed at the edge, we keep some state — we want to route UDP responses back through the same path they took inbound, so when we process an incoming UDP packet on a worker we update a second map saying “responses to this source address should go to this edge host”.

We can add I guess arbitrary hash lookups, and we can have different hash lookups on our edge (where incoming packets arrive) and on our workers. We can chain hash tables together (there’s a map that is “map of maps”). But we can’t loop over a chain; whatever the chain is, it has to be expressed in essentially straight-line code.

Most of the work we’re talking about here will have to happen on edges (by the time a packet hits a worker we’ve already decided the path it’s taking).

We can do lookups on ports (or any other packet data that is at a fixed offset in the packet).

It would be difficult to do arbitrary port mappings; when figuring out what to do with an incoming packet based on port, I essentially have to be able to make a decision based on a fixed number of map lookups. It’s hard to have an array of port mappings and a foreach over them.

It’s also painful to keep per-source state. We do that on the worker because the penalty for evicting state and then referencing it is low (in the worst case, if we process an incoming packet on a worker and then lose its state before the reply is generated, we just do direct server response from that worker instead of honoring the ingress path). But if we’re using it to pin sessions to workers, when we lose state, we break the underlying session.

I’m cool with talking about this, especially if y’all have ideas on how we might think about implementing stuff. I just want to make sure we’re open about what our limitations are.

3 Likes

Thomas: I am not 10% the eng that you’re, so there can only be ignorant (as opposed informed) suggestions from me, and that’s pretty much about my limit, too. (: With that out of the way…

that proxied UDP packets don’t carry source addresses, and all UDP protocols depend on those to send responses.

The way fly-proxy handles TCP today, doesn’t preserve source (client) ip-port. We’ve got an admission control (api-gateway) front our service that depends on source (client) ip-port, but well, we couldn’t use it on fly. Fly’s support for PROXY protocol is god-sent which I need to explore integrating at a later date. This is just to note that, there’s indeed legitimate usecases for TCP apps to need source (client) ip-port.

Our UDP forwarding pushes and pops proxy headers onto UDP packets to forward them.

Does this mean UDP is tunneled within UDP or, do I misunderstand this?

I essentially have to be able to make a decision based on a fixed number of map lookups. It’s hard to have an array of port mappings and a foreach over them.

I have never looked at any BPF code (in my life) to know if this is possible, but how about we reduce the problem of port range to that of something like looks like ip-subnet classes. For instance, say fly splits ports in groups of 1024. I can then specify in my toml, how the first 1024, the second 1024, the third 1024 and so on up to 64th 1024 port subgroups should be routed (if that makes sense)? We did something similar for a toy-prototype, to shape traffic in our cluster of hosts (on AWS before Global Accelerator had support for pinning):

# unassigned port subgroups are treated to default routing behaviour
# if valid, overlapping port subgroups are treated to smallest range wins behaviour

[[services.ports]]
regions = ["sea"]
protocol = "udp"
port_mask = 0x0000_0000_0000_ffff # 16*1024 ports; range[0-16384)

[[services.ports]]
regions = ["fra"]
protocol = "udp"
port_mask = 0x0000_ffff_ffff_0000 # 32*1024 ports; range[16384-49152)

[[services.ports]]
regions = ["ord"]
protocol = "udp"
port_mask = 0x0001_0000_0000_0000 # 1*1024 ports; range[49152-50176)

[[services.ports]]
regions = ["atl"]
protocol = "udp"
port_mask = 0x1ff3_0000_0000_0000 # 12*1024 ports; range[50176-62464)

[[services.ports]]
regions = ["maa"]
protocol = "udp"
port_mask = 0xe000_0000_0000_0000 # 3*1024 ports; range[62464-65536)

a bitmask like this lends itself to overlaps (if desired) and cleaner splits (if desired) among the pre-defined1024 port subgroup. not sure about dev-x, though.

But if we’re using it to pin sessions to workers, when we lose state, we break the underlying session.

Sure, but consistent hashing among workers (among a set of “nearest”/“known” workers) using client ip-port will help here without having to maintain state but be able to pin a UDP flow? Though, I must admit, I haven’t thought through all implications of doing this at the edge.

Thanks for the detailed explanation! Appreciate it.

So, TCP doesn’t usually need the source addresses, because it’s connection-oriented. You can run it through any arbitrary chain of proxies and the responses will all be routed back hop-by-proxy-hop. TCP keeps state blocks (TCBs) for each of those proxy connections, which is what makes that work.

The same is not true of UDP. You can simulate TCP with UDP, of course, but different protocols handle this differently (and none of them have a SYN or an RST to set up and tear down state).

So while there are a bunch of TCP applications where you can get away with raw plugboarded proxies (stripping the source addresses), there aren’t for UDP.

(None of this matters at all of HTTP protocols, because they have headers to carry metadata like this in).

For what it’s worth: some non-HTTP TCP applications do care about source addresses, and for those we support the HAProxy proxy-protocol. But it’s fussy: your TCP software has to know about proxy-proto to make it work.

We can do power-of-2 ranges (or really power-of-anything ranges, so long as it’s the same power for everyone) of ports, to your latter question.

1 Like

Make sense to use the PROXY protocol for TCP, given the current fly-proxy impl.

The same is not true of UDP. You can simulate TCP with UDP, of course, but different protocols handle this differently (and none of them have a SYN or an RST to set up and tear down state).

Well, gateway routers / NATs usually timeout (“RST”/“FIN”) udp flow state in 30s or 2m (or sometimes upto 5m). For instance, AWS Global Accelerator times outs udp “connections” after 30s: How AWS Global Accelerator works - AWS Global Accelerator

fwiw, QUIC (rfc9000 10.1.2) recommends these values (30s / 2m), as well for ping ↔ ack probe ceremony, though with QUIC’s support for connection-migration (rfc9000 5.1), it is unlikely a BPF program (which are limited to reading fixed length bits in the packet) is suitable to pin based on QUIC Connection IDs (rfc9000 17.2).

I guess Kurt was right. Apps better handle UDP steering for connection oriented protocols within their own fly-apps.

And yes: I wouldn’t mind to see this is implemented :smiley: