[LiteFS] Cannot Become Primary After Restart

Restarting my machine explicitly, gets it stuck in an endless loop of this error:

cannot find primary, retrying: no primary
cannot become primary, local node has no cluster ID and "consul" lease already initialized with cluster ID ...

This happens with

$ fly machine restart ...

and

$ fly deploy

It doesn’t happen when Fly suspends and restarts the machine, so how do I restart my machines in the same way as the suspension mechanism?

I think the issue you have is found here -

You can update the fly console key (easier option), but that didn’t work for me. But this option did the other day. There are a few other scenerios on that page (both above and below), so it may be the one above also.

1 Like

I saw this, but both solutions seem suboptimal.

The first isn’t clean and creates a new ID every time and I assume I have to tell my replicas about the new ID.

The second is a one-off solution, where I have to SSH into my primary, tell Consul what’s up, and then restart the machine again.

I just want my machines to work automatically when they restart :smiley:

Are you saving your LiteFS data directory to a volume?

We are in the process of reworking the cluster ID because folks have hit some pain points with it. We’re going to change it to be user-defined instead of generated to avoid situations like this. You can find the relevant issue here: User-Generated Cluster ID · Issue #393 · superfly/litefs · GitHub

1 Like

I don’t quite understand the issue here.

When my machine restarts after suspension, things are good.

When my machine restarts abruptly, it can’t become primary anymore.

When I change my cluster-ID and redeploy, it works again.

How is me manually changing the ID different from the user-defined ID solution?

Isn’t the problem related to Consul not allowing a new primary when one died?

–edit–

And yes, I have SQLite .db files on that volume that I wanted to connect with LiteFS.

The goal of the cluster ID is to prevent two distinct clusters from accidentally connecting to one another which would cause conflicts in the data since they both have two distinct sets of databases. The user-defined ID would make it so you set the value to something like $FLY_APP_NAME so it’s not autogenerated by LiteFS and stored in /var/lib/litefs/clusterid (which could be wiped if the volume is lost).

Typically when we see the cluster ID issue it’s that a cluster connected up to Consul and saved its ID so the Consul lease key can only be used with the first-connected LiteFS cluster. Then if someone clears the volume, LiteFS regenerates the cluster ID so it appears as a new distinct cluster.

Can you post the litefs.yml you’re using and the [mounts] section of your fly.toml? Just want to narrow down the exact issue happening.

Also, are you using autostart/autostop with your machines?

The litefs.yml:

fuse:
  dir: '/app/data/sqlite'

data:
  dir: '/var/lib/litefs'

exit-on-error: false

exec:
  - cmd: '/bin/sh /app/docker-entrypoint.sh'

lease:
  type: 'consul'
  advertise-url: 'http://${HOSTNAME}.vm.${FLY_APP_NAME}.internal:20202'
  candidate: ${FLY_REGION == PRIMARY_REGION}
  promote: true

  consul:
    url: '${FLY_CONSUL_URL}'
    key: 'litefs/${FLY_APP_NAME}'

The mount:

[mounts]
  source="app_data"
  destination="/app/data"

The data.dir field is what should be on the persistent volume. Try changing your fly.toml mount to:

[mounts]
  source="app_data"
  destination="/var/lib/litefs"

The data directory is being wiped out when you restart your machine which also clears out the cluster ID so that’s why it autogenerates a new one and can’t connect to Consul.

We’re working on improving the setup for LiteFS so it’s not quite as complicated.

1 Like

I see!

So, what about my /app/data directory? I want to persist it too.

Should I simply mount the volume at / or create a symlink in the Dockerfile to my data directory?

You can change the LiteFS data directory to be inside the /app/data directory. e.g.

fuse:
  dir: '/app/data/sqlite'

data:
  dir: '/app/data/litefs'

I believe that the FUSE mount should still work fine even if it’s mounted inside of the volume mount.

1 Like

That fixed it.

Now my restarts work as expected, and the litefs volume connects to my persistent volume instead of the root.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.