Restarting my machine explicitly, gets it stuck in an endless loop of this error:
cannot find primary, retrying: no primary
cannot become primary, local node has no cluster ID and "consul" lease already initialized with cluster ID ...
This happens with
$ fly machine restart ...
and
$ fly deploy
It doesn’t happen when Fly suspends and restarts the machine, so how do I restart my machines in the same way as the suspension mechanism?
You can update the fly console key (easier option), but that didn’t work for me. But this option did the other day. There are a few other scenerios on that page (both above and below), so it may be the one above also.
Are you saving your LiteFS data directory to a volume?
We are in the process of reworking the cluster ID because folks have hit some pain points with it. We’re going to change it to be user-defined instead of generated to avoid situations like this. You can find the relevant issue here: User-Generated Cluster ID · Issue #393 · superfly/litefs · GitHub
The goal of the cluster ID is to prevent two distinct clusters from accidentally connecting to one another which would cause conflicts in the data since they both have two distinct sets of databases. The user-defined ID would make it so you set the value to something like $FLY_APP_NAME so it’s not autogenerated by LiteFS and stored in /var/lib/litefs/clusterid (which could be wiped if the volume is lost).
Typically when we see the cluster ID issue it’s that a cluster connected up to Consul and saved its ID so the Consul lease key can only be used with the first-connected LiteFS cluster. Then if someone clears the volume, LiteFS regenerates the cluster ID so it appears as a new distinct cluster.
Can you post the litefs.yml you’re using and the [mounts] section of your fly.toml? Just want to narrow down the exact issue happening.
Also, are you using autostart/autostop with your machines?
The data directory is being wiped out when you restart your machine which also clears out the cluster ID so that’s why it autogenerates a new one and can’t connect to Consul.
We’re working on improving the setup for LiteFS so it’s not quite as complicated.