Websocket sharding?

Hello. I’m developing a Discord bot and though I don’t necessarily need to scale horizontally yet, I am thinking about how to do so. The way Discord bots work is through creating a websocket connection, and in order to shard you provide a [shard_id, num_shards] tuple along with the request. Currently my best idea for how one might do this easily on Fly without some sort of supervisor process is through process groups – where you could have a static number of shards and assign shard IDs manually through the execution command.

Is there some easier way? Unfortunately, these shard IDs apparently have to be related to the number of shards rather than some UUID, otherwise the runtime env vars that Fly provides would be sufficient.

A more complicated architecture seems like it would involved redis and gracefully shutting down + removing your shard ID from the cache, or maybe setting a TTL and having a background process always re-insert the current shard ID, but these architectures seem complicated for where I’m at.

1 Like

:wave: I used to run a large Discord bot, and my solution for horizontal sharding across processes was to set environment variables on each individual process with a manually assigned (cluster_id, cluster_count, shard_count) combination. I then used redis to handle identify ratelimiting (the “you can connect one shard per 5 seconds” limit), with a “set if not exists” command and a TTL.

On Fly, you can use the machines API or flyctl machine commands to create individually managed machines in an app, although unfortunately, it’s not yet possible to update your machines with fly deploy with this setup.
Process groups would work better if you want to use fly deploy or manage your clusters in a configuration file rather than creating machines ad-hoc.

You can use a TTL / SETNX for this, similarly to my identify ratelimiting point above.
At one point, I was considering running multiple clusters with the same set of shards, where only one cluster would actually run the shards at any given time, and periodically (every ~5sec or so) update redis with the shards it’s currently managing. If the other cluster instance detects the shards not being updated anymore, it would reconnect itself to them, so the first cluster instance could be rebooted for zero-downtime updates.
(Conveniently, Discord allows you to reconnect a shard even if the previous process never disconnected from it. It magically just starts giving you events on the new shard instance and not giving you events on the old one.)
I never did manage to find time to actually build this, unfortunately, but it should work pretty well.

2 Likes

Awesome to hear that I’m at least on the right path. I’ll probably set up something simple with process groups at first and then move to a Redis-based solution if I ever have need to scale more dynamically.

Thanks for the great info!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.