I am running a lightly customized NATS super cluster. It’s great with NATS pub/sub. Jetstream persistence has a problem when instances go down and come back up. The current naming scheme is based on
FLY_ALLOC_ID. As instances go down and come back up, they have a different name but NATS jetstream looks for the prior machine names to resync.
I haven’t used Terraform and it seems like overkill when what I really want, at this stage, is a registry to pull names from.
For now I think my simplest manual option is to use 1 node per region with nearby regions to form clusters, and manually set the cluster configuration.
Or perhaps something simple like Ben Johnson’s scale to zero machine example. Ben Johnson: Scale to Zero with Fly.io Machines
TXT nslookup on
vms.<appname>.internal should get you a csv of
alloc-ids assigned to
<appname> (docs). Though, it may not always be current, it mostly should converge eventually to whatever’s the current state of the world.
Btw, one can assign volumes to VMs to make the
alloc-ids stick across restarts / deploys, but it is a rather expensive way to do so. And: Machine VMs tend to be fixed on a single IPv6 (not sure about
alloc-id, though). More here: Fly-Instance-Id header alternative for websockets - #2 by ignoramous
Thanks much! I’ll take a look there.
@ignoramous is there any documentation about the sticky
alloc-ids? That would work for me as these do have volumes assigned already. That would make things much easier. Might make it easier for their system as well.
Using volumes to make
alloc-ids stick was called anchor scaling (can search for it in the forums as the docs are gone). I believe, in favour of Machines, which are kind of assigned sticky
alloc-ids (I can’t say if it is incidental or an actual feature).
Thanks for that @ignoramous. I see Is it possible to make scaling more deterministic? but will ping support on current and future options.
@ignoramous Anchor scaling is still the way to go.
@kurt It turns out NATS Jetstream requires a consistent
server_name. So booting up I check or set a
server_name file on the volume, use that for the property, and .
What I don’t see yet is a true rolling deploy with the persisted volumes. With 9 nodes on 3 regions it began to stop 5, including all 3 of one region. I’ll start a new issue for that one.