How to setup a HA Redis Clusters on Fly

Hey everyone,

We are looking to get a better setup for using Redis on Fly. I have heard that there is potentially a similar product like fly postgres coming soon, but figured this topic could either help expedite that product or at least help in its development.

Right now we have an app deployed running redis with a single volume, a single app with scale count 1 and only a single region in the pool.

As you can see, this could cause some bad situations if there were to ever be an issue with either the volumes, vm runtime, networking, etc in that one region.

I should add we are using redis for a job queue in this particular case (BullMQ), so we would need to ensure that there is not any drift happening with the job queues, but we would like to know that if a regino or VM goes down, this does not take down our apps.

At the moment if there is any issues with our single Redis app, it will then not only stop taking jobs on to the queue but it will also start causing issues for the other VMs trying to connect and failing, therefore causing apps to start rebooting and potentially start failing.

Looking forward to hearing or learning together how we can achieve HA Redis on Fly.

There’s a sample app for setting up Redis with a replica here: GitHub - fly-apps/redis-geo-cache: A global Redis cache

This does not automatically promote the replica to primary in case the primary goes down, but it does streaming replication, so it’s possible to manually switchover in case there’s a problem.

A full HA setup would add a manager-like component to switch instances to/from being primary/replica and route requests to them, similar to what Stolon and HA Proxy do in GitHub - fly-apps/postgres-ha: Postgres + Stolon for HA clusters as Fly apps.

I think Redis Sentinel does everything we’d want, but I haven’t looked into it yet Redis Sentinel Documentation – Redis

1 Like

Redis HA is pretty complicated. They’re working on a better model, but the actual HA redis setup right now requires a minimum of 5 VMs. Probably extreme for a job queue.

KeyDB is simpler since it can do master-master replication, but it’s also not ideal for a job queue (for the same reason).

The best thing to do here might be to build some resilience in at the app level. If the app can keep running when the job queue server goes away, you can survive outages. Single server outages are rare. They happen, and you should count on them happening, but if your app can tolerate Redis going away for a few hours it might be the simplest solution.

1 Like

That makes a lot of sense, I sort of figured the KeyDB would cause issues with duplicate jobs being run or vice versa.

I will look into handling the connection better so that it doesn’t take down the app servers and workers when Redis goes down.

The other interesting part of this would be to also have a secondary method for keeping track of jobs that need to be scheduled once redis has come back up. I wonder if this makes sense to store in a more resilient DB like PG or Mongo when redis is down, then a cron that checks that db for failed jobs due to redis being down and reschedules them if redis is back up.