How many long-lived websocket connections can I have?

I’m getting ready to beta-test a service that relies on many long-lived websockets connections. Long-lived might be hours (or even days) long. It’s okay if they disconnect – they’ll reconnect. I’d just like to avoid the expensive re-connection every minute.

During testing, we’ll likely have less than a hundred at a time, but with our user base, it could get up to 10k-100k simultaneous connections.

I want to be a good citizen and not blow things up, so:

  1. How many simultaneous connections can a single Firecracker VM support?
  2. Am I okay adding a sub-minute heartbeat to keep the websocket from closing? Or is there a way to configure fly to not close connections after a minute? I’d rather not have a heartbeat, as it seems like a waste of data and I’d rather not make mobile devices or fly constantly send/receive heartbeats.

I’ve read a few other topics like Long-lived TCP connections are dropped and Is it possible to increase the timeout to 120 sec - #4 by ignoramous which makes me think a heartbeat is the best option for keeping it alive. But I haven’t seen any indication of how many open connections is too many.

Reading Metrics on Fly.io · Fly Docs and then looking at the File Descriptors metric chart on Grafana leads me to believe I can use up to about 20K connections (well, file descriptors, which may not be exactly 1:1).

If this isn’t right, I’d rather be told now than exceed a limit after going live :slight_smile: Yes, I’m asking for permission rather than forgiveness.

20k is a good amount of connections per VM. Adding a heartbeat every ~50 seconds is fine, that’s what I’d recommend doing to keep connections open.

1 Like

We recommend scaling horizontally instead of trying to shove too many connections on a single VM. Keeping it under ~30K is probably a good idea.

The number each VM can support depends largely on your application and the size of the VM (CPU and RAM).

That metrics comes “from within” the VM. You have root access to your VM and can change the max open fds (ulimit and all that).

1 Like

Alright, thank you! I’ll aim for 20k for now.

The app isn’t yet built to be able to scale horizontally (single SQLite database). Our current plan is exactly to “shove too many connections on a single VM” and see how far it can go with just one :slight_smile: I’m trying to delay horizontal scaling until LiteFS - Distributed SQLite · Fly Docs is production-ready, because that will simplify the design.

I think what we really need to know from Fly is how many simultaneous connections the proxy can handle and will allow, for the organization as a whole, for the application, and per machine. It would be good to explicitly document such limits. That’s one thing I appreciate about AWS documentation. Thanks.