I’m running a single-instance Redis as a job queue (one app, no [[services]], accessed over .internal 6PN). It must have exactly one instance running at a time — two simultaneous instances would split-brain the queue. I’ve set up a standby machine for host-failure failover (volume-less, service-less process group, fly scale count 2, standby shows the † in fly status).
The docs and fly status note describe when the standby starts (“only in case of host hardware failure”), but I can’t find anything about what happens afterward, when the failed host recovers. Could you confirm the exact behavior?
When the original machine’s host comes back online, does the original machine automatically restart (since its desired state was started)?
If so, do I then have two machines running simultaneously (original + the standby that took over)?
Does the standby automatically stop / hand back when the original recovers, or does it stay running until I intervene?
If two can run at once, what’s the recommended way to enforce single-instance — is there any built-in mechanism, or do I need to handle it myself (e.g. monitor for >1 started machine and stop the extra)?
Context: app is dev-redis-queue, region arn, two machines (one active, one † standby), no volumes, no services. I want to be sure a host recovery can’t leave me with two live Redis instances.
Hi… I’m curious about this, too, but, realistically, I think I’d go ahead and implement the “handle it [your]self” countermeasure, , regardless of what the answer is.
The standby feature is basically impossible for us ordinary users to test, which to me means that its behavior is never fully nailed down. Not enough so that it can be relied upon when data corruption is the risk of guessing/interpreting incorrectly, anyway.
(My understanding is that standby status was more intended for Rails-style heavyweight worker Machines, where the queue was enforced elsewhere and the only penalty for having two simultaneously was extra cost.)
LiteFS uses Consul leases, which are quasi-convenient for this kind of thing, since the Fly.io platform includes a multi-tenant(?) freebie cluster. (It still requires some scripting on your part, like you were already prepared for.)
Aside: As a general tip, it’s best to opt in to the Questions / Help category when you’re hoping that something will get a response from a Fly.io employee. That section of the forum has special status, as noted in the new sticky thread, whereas posts elsewhere are much more prone to falling through the cracks…
Aside2: I just added that category to this thread.
Given your “exactly one” requirements, your best bet would be to not use the standby machine (here’s how to remove it: https://fly.io/docs/apps/app-availability/#standby-machines-for-process-groups-without-services). Instead, for resiliency (which I bet is also important), start two Fly machines, both live and running Redis, and set Redis up in a replica configuration. I don’t know if replica support is still behind Enterprise on Redis proper, but if it is, you can use Valkey where replicas are available out of the box.
The standby Fly machine will start if the primary’s host becomes unreachable due to hardware or network failure. Crucially it will not start if your machine itself becomes unavailable e.g. because it got stopped or destroyed.
If the host then returns from the dead, the standby machine will stop by itself after a relatively long period (I think it’s one hour).
If two can run at once, what’s the recommended way to enforce single-instance
You can always e.g. wrap Redis startup in some kind of shell script that ensures there’s no other instance running, but in general, this is handled at the application layer, not as a Fly platform feature.