UDP connections timeout after exactly 5 minutes despite active keepalives

I’m experiencing consistent UDP connection timeouts on Fly.io that occur after exactly 5 minutes, regardless of active traffic.

Setup:

  • Two relay servers in sin region, both using shared dedicated IPv4
  • UDP server bound to fly-global-services:port as documented
  • Client sending keepalive packets every 2 seconds
  • Both machines remain in “started” state (not auto-sleeping)

Observed behavior:

  • UDP works perfectly for exactly 5 minutes
  • After 5 minutes, server stops receiving ANY UDP packets (goes completely mute)
  • Keepalives every 2 seconds do not prevent this timeout
  • Inconsistent recovery: One machine can re-establish UDP after reconnect, the other cannot

Testing done:

  • Verified proper fly-global-services binding
  • Confirmed machines stay awake during timeout
  • Tested with aggressive keepalives (2-second intervals)
  • Reproduced consistently across multiple test sessions

Questions:

  1. Is there an undocumented 5-minute hard timeout for UDP flows?
  2. Why does recovery behavior vary between machines in the same region?
  3. Is this related to the recent UDP fixes deployed on January 15th?

This appears similar to the pattern of UDP reliability issues reported over the past few years. Any official guidance on UDP session management would be appreciated.

Fly’s UDP works with individual packets, not flows. what protocol are you using on top of UDP? can you tcpdump on the client/server to see if any packets are coming through?

I’m not sure if “shared IPv4” was a typo as you do mention it working for some time, but UDP only works with dedicated IPs.

Thanks for answering.

Oh yeah, dedicated IP sorry.
I’m aware of UDP internals, we are using our own stack: shards/shards/modules/network/network_kcp.cpp at devel · fragcolor-xyz/shards · GitHub based on KCP.

I did tcpdump and it was a flat line as I said after a few minutes of echoing.

We are considering migrating from azure to fly.io for https://formabble.com/ btw, our game relays coordinate game state CRDTs over UDP (for more context).

I think I found the root cause:

Key discovery: The 5-minute timeout only occurs when running multiple machines in the same app. With a single machine per app, UDP works fine indefinitely.

My setup that failed:

  • 1 app with 2 machines on different ports (e.g. 5000, 5001)
  • Both bound to fly-global-services:port
  • After 5 minutes, UDP routing breaks

Questions:

  1. How does Fly’s UDP proxy/anycast handle routing to multiple machines in the same app?
  2. Does UDP traffic get “sticky” to the first machine that responds, and does this stickiness expire after 5 minutes?
  3. Is the recommended architecture one app per UDP service rather than multiple machines per app?
  4. For game relay servers, should we deploy each relay as a separate Fly app rather than scaling machines within a single app?

This would explain the inconsistent recovery behavior I observed - the proxy was probably getting confused about which machine to route to after the timeout.

Can you confirm if the “one app per UDP service” model is the intended design pattern?

Cheers,

Hi… The UDP side of the Fly.io platform is limited, even more so than is apparent at first glance, and a little underspecified, but my guess is that the behavior that you’re seeing is primarily due to a misunderstanding of how fly.toml works.

(That configuration file trips up many people at first, even those with extensive prior networking experience—in part because flyctl itself usually doesn’t warn about logical, or even syntactic, mistakes.)

You didn’t mention process groups at all, for example, which are the only way to get the structure that you sketched out to work reliably.

I’m not with Fly.io, but one app per external service is the sweet spot of their platform, in my opinion.

Anything else, you find yourself fighting the tools, TOML serialization, etc.

It’s mainly intended for a fleet of Machines that are interchangeable and more or less disposable.

An architecture that instead emphasizes a single Machine per port is generally a bad idea—but might be okay in special cases, if, say, the clients themselves are equipped to failover from one port to the other.

Anyway, just my own 2¢…

1 Like

Thanks for your input!

We are bypassing the toml completely, the whole deployment is fully automated using bits of machine api and graphql.
We tried using processes but they seem ignored at firewall/rev-proxy layer (even docs mention this) but maybe I will give it a go again.

Anyways we managed to deploy successfully with 1 service per app.

Only issue I noticed was some junk UDP packets received by our low level systems. Nothing worrying tho. Maybe udp being udp.

Would be good for us tho to understand how things work under the hood if any of the devs can pitch in, mainly to confirm our observations.

About using multiple machines, well it would have improved the whole architecture (as it’s not trivial) but no big deal.

multiple machines with different ports in the same app should work fine, assuming the right config. I haven’t looked at UDP recently so I don’t recall off the top of my head what the right config is, or how our UDP catalog service handles multiple machines; I’ll take a bit of time next week to work that out and get back to you.

one underlying implementation detail that might be useful, or at least interesting:
for TCP services, we run a Rust proxy service that terminates connections and creates new ones to machines.
for UDP services, there is no such service - instead packets are forwarded directly to machines by the kernel. we use some XDP (eBPF) code to tell the kernel where to send packets, and a “catalog” service to configure this code.
I believe the reconfiguration happens when a service is updated, rather than on an interval.

Thanks for the detailed explanation - the XDP/eBPF architecture makes a lot of sense for UDP’s stateless nature.

I will run a few more tests based on this new knowledge.

Really appreciate you taking the time to investigate this.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.