Websocket sharding?

bleeding · May 15, 2023, 9:03pm

Hello. I’m developing a Discord bot and though I don’t necessarily need to scale horizontally yet, I am thinking about how to do so. The way Discord bots work is through creating a websocket connection, and in order to shard you provide a [shard_id, num_shards] tuple along with the request. Currently my best idea for how one might do this easily on Fly without some sort of supervisor process is through process groups – where you could have a static number of shards and assign shard IDs manually through the execution command.

Is there some easier way? Unfortunately, these shard IDs apparently have to be related to the number of shards rather than some UUID, otherwise the runtime env vars that Fly provides would be sufficient.

A more complicated architecture seems like it would involved redis and gracefully shutting down + removing your shard ID from the cache, or maybe setting a TTL and having a background process always re-insert the current shard ID, but these architectures seem complicated for where I’m at.

lillian · May 15, 2023, 9:37pm

I used to run a large Discord bot, and my solution for horizontal sharding across processes was to set environment variables on each individual process with a manually assigned (cluster_id, cluster_count, shard_count) combination. I then used redis to handle identify ratelimiting (the “you can connect one shard per 5 seconds” limit), with a “set if not exists” command and a TTL.

On Fly, you can use the machines API or flyctl machine commands to create individually managed machines in an app, although unfortunately, it’s not yet possible to update your machines with fly deploy with this setup.
Process groups would work better if you want to use fly deploy or manage your clusters in a configuration file rather than creating machines ad-hoc.

You can use a TTL / SETNX for this, similarly to my identify ratelimiting point above.
At one point, I was considering running multiple clusters with the same set of shards, where only one cluster would actually run the shards at any given time, and periodically (every ~5sec or so) update redis with the shards it’s currently managing. If the other cluster instance detects the shards not being updated anymore, it would reconnect itself to them, so the first cluster instance could be rebooted for zero-downtime updates.
(Conveniently, Discord allows you to reconnect a shard even if the previous process never disconnected from it. It magically just starts giving you events on the new shard instance and not giving you events on the old one.)
I never did manage to find time to actually build this, unfortunately, but it should work pretty well.

bleeding · May 15, 2023, 10:03pm

Awesome to hear that I’m at least on the right path. I’ll probably set up something simple with process groups at first and then move to a Redis-based solution if I ever have need to scale more dynamically.

Thanks for the great info!

system · May 22, 2023, 10:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fly Workers via BullMQ via Throng	10	3773	November 13, 2021
How do I run `PORT=1234 npx y-websocket` in a multiprocess? Questions / Help rails , streams	13	145	March 29, 2025
How to run a single machine without stopping during idle Questions / Help	3	259	January 18, 2024
Upstash Redis - Connection lost (ECONNRESET) (Redis::ConnectionError Questions / Help	23	3887	July 21, 2023
How to configure the workerID of the current machine in distributed mode? Questions / Help machines	4	159	February 1, 2024

Websocket sharding?

Related topics