Websocket network hiccups

Hi everyone!

I’m hosting my game project on fly.io, everything uses websockets for now. It is very time critical game, and server authorized, utilizing Bun instead of Node. I’ve managed to run it with 4x shared machine, and it works pretty well on that.

I’ve had a “problem” for quite a while now, where I get REALLY random hiccups with the data transfer. Most of the play testers don’t notice anything, but I do since I’ve been staring the game for well over a year now :laughing:

Here is an example:

The client gets a random hiccup of almost 400ms from the server. The payload is pretty large on this case, as there was 100 enemies following the player. This happens on smaller amounts as well. This experiment was done with performance 1x machine to make sure there are enough resources, and I’ve also checked that the server should not be overexceeding the frame budget there. I use Bun’s websocket implementation (based on uWebSockets), if that matters. I have another server acting as a gateway, proxying requests to the actual game server. The actual game servers are dynamically created machines.

So what I would like to make sure is that this is NOT related to fly in some way! I’ve spent a lot of hours already trying to understand what could be the problem on my end. If you guys have any ideas how to make sure this is NOT a fly problem, I would greatly appreciate it! Thank you :smiling_face_with_three_hearts:

1 Like

Bump to prevent thread from closing (I hope this could be turned off)

Shared machines can be throttled. If your CPU load reaches 6.25% (4 x 6.25% for a 4x instance), Fly will start throttling the app after a burst of temporary forgiveness. This may explain occasional delays you see. Throttling is not linear - it’s not like you suddenly have a slower CPU, it’s more like your app starts to crawl. A test is to temporary use 1x-2x performance CPU - if delays go away, here is your answer.

UPDATE: I saw that you already performed testing on a performance CPU, I suggest you to try 2 performance CPU cores to be 100% sure it’s not related to CPU scheduling.

1 Like

One thing to note here is that TCP was never designed for real-time communication and really this kind of hiccups could just be TCP deciding to buffer for longer than it should, or a random packet loss that causes retransmission. Unless this happens all the time, I’m not sure I have a great suggestion here… Other than maybe checking if you have any packet loss to us by running mtrs around the time when these happen. The “proper” solution is HTTP/3 and WebTransport but that is a big someday :trade_mark:

Also, since there are multiple hops in your app, one thing you could try is to log these latencies at each hop and compare them with what you see on the client side. If this is internal to Fly, then it should show up there before data is piped at the last hop.

1 Like

I think I will try this later on! But the Grafana says that I should be ok CPU wise even with 4x shared cpus.

Yeap, that is what I’ve been thinking it might be. It is still a bit annoying to live with “it might be this” without any proof that it is that.

This too, I could do WebRTC (did this already in another project in fly) but adding WebRTC or trying to hack WebTransport to work somehow adds another level of complexity to the code… Not willing to spend time on that if I’m not 10000% sure the fault is with websockets.

What I’ve been experimenting with is to have my movement navigation thingy simulated in both client and server and trying to keep them semi sync, since full sync is not really that important in my game. So far the results are pretty promising and this way I could keep the websockets intact, since small hiccups wouldn’t effect the game negatively.

Also made some visualizations for server locations and entity targets so I can visually see any hiccups and also see if the server load has an effect of those hiccups or not.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.