Trying to bring up a machine after deleting its volume

So I’ve got in a weird state (don’t ask why :sweat_smile:). Here’s basically what I did:

# stop the machine
fly scale count 0
# delete the machine's volume
fly vol delete vol_id-of-the-machine
# create a new volume
fly vol create ...
# start up the machine again
fly scale count 1

This doesn’t work. I’m now getting:

fly scale count 1 --app epic-stack-template 
App 'epic-stack-template' is going to be scaled according to this plan:
  +1 machines for group 'app' on region 'sjc' with size ''
  Using 1 existing volumes for group 'app' in region 'sjc'
? Scale app epic-stack-template? Yes
Executing scale plan
  Using unattached volume vol_8l524yjpxm147zmp
Error: failed to launch VM: invalid_argument: could not claim volume for machine: volume already deleted, unable to use

Here’s my fly vol list output:

$ fly vol list --app epic-stack-template
ID                      STATE           NAME    SIZE    REGION  ZONE    ENCRYPTED       ATTACHED VM    CREATED AT    
vol_8l524yjpxm147zmp    pending_destroy data    1GB     sjc     514a    true                           2 months ago 
vol_nv8e1mxx7dzj5nlv    created         data    1GB     sjc     e57e    true                           8 minutes ago

I’m not sure where to go from here.

It looks like the app is trying to connect to the old deleted volume. Not sure if the machine was destroyed prior to the volume being destroyed, but generally to destroy a volume you have to destroy the machine the volume is attached to first.

To fix this, you might need to redeploy without any volumes attached before running the volume creation step. Can you try removing the mounts section of your fly.toml and running a redeploy? And once that works, try running the volumes create step again.

There’s been some investigation internally about this and we found that this happened as a result of flyctl switching to using the Machines API for volume operations sometime last week. The current fix is to either wait 24 hours for the pending_destroy volume to be completely removed or to downgrade to v0.1.71 or earlier and try running the above commands again.

As for the status on this bug, there’s a fix for this in progress. Once it’s been merged and ci passes, it will be integrated into the new version of flyctl.

flyctl v0.1.78 is now available and contains a fix for this. If you continue to have any issues with this after upgrading, please let us know!

1 Like

Thanks for the quick update!

I’ve tried scaling up again and now I’m getting a new error in my logs that may interest @benbjohnson:

2023-08-15T19:12:36.036 app[e2865106ce1148] sjc [info] level=INFO msg="cannot find primary, retrying: no primary"

2023-08-15T19:12:37.107 app[e2865106ce1148] sjc [info] level=INFO msg="cannot become primary, local node has no cluster ID and \"consul\" lease already initialized"

2023-08-15T19:12:37.176 app[e2865106ce1148] sjc [info] level=INFO msg="cannot find primary, retrying: no primary"

2023-08-15T19:12:38.248 app[e2865106ce1148] sjc [info] level=INFO msg="cannot become primary, local node has no cluster ID and \"consul\" lease already initialized"

Over and over again. I’ve verified my litefs config has candidate: ${FLY_REGION == PRIMARY_REGION} and the PRIMARY_REGION is set to sjc so I’m not sure what’s up :man_shrugging:

hey Kent, this error occurs because you had a cluster before and when you deleted the cluster, it lost its cluster ID. The cluster ID is meant to uniquely identify your cluster so you don’t accidentally point two separate clusters to the same Consul key and wipe one out.

We have some instructions on how to clear the cluster ID on Consul here: Disaster Recovery from LiteFS Cloud · Fly Docs

The easiest approach is to just change the Consul key that you’re using for your cluster though. You can change that in your litefs.yml file with the lease.consul.key field. The trickier way is to install Consul CLI and remove the key yourself.

We will provide an easier way to reset this in the future.

Got it. Can I change it to something else, and then change it back?

I’m getting these logs now:

2023-08-17T06:02:04.216 app[e2865106ce1148] sjc [info] level=INFO msg="begin streaming backup" full-sync-interval=10s

2023-08-17T06:02:04.290 app[e2865106ce1148] sjc [info] level=INFO msg="exiting streaming backup"

2023-08-17T06:02:04.290 app[e2865106ce1148] sjc [info] level=INFO msg="backup stream failed, retrying: fetch position map: backup client error (422): cluster id mismatch, expecting \"LFSC42DFE36BEF76EA1E\""

If you ssh into the node, you can update the cluster ID by writing to the clusterid file in the data directory and then restarting the node:

$ echo LFSC42DFE36BEF76EA1E > /var/lib/litefs/clusterid

Also, I PR’d a change to the Consul error you saw above so that it actually lists the original cluster ID: Report existing consul cluster ID by benbjohnson · Pull Request #389 · superfly/litefs · GitHub

Sadly, doing this did not seem to work. It’s still not happy.

What error messages are you seeing now? Also, is /var/lib/litefs your data directory? Can you share your litefs.yml?

I tried running this in both the litefs virtual directory and the data directory:

root@e2865106ce1148:/myapp# echo LFSC42DFE36BEF76EA1E > /litefs/clusterid
root@e2865106ce1148:/myapp# echo LFSC42DFE36BEF76EA1E > /data/litefs/clusterid

Didn’t make any difference. My output is still just:

2023-08-18T06:12:17.574 app[e2865106ce1148] sjc [info] GET /resources/healthcheck 200 - - 1.174 ms
2023-08-18T06:12:17.835 app[e2865106ce1148] sjc [info] level=INFO msg="begin streaming backup" full-sync-interval=10s
2023-08-18T06:12:17.908 app[e2865106ce1148] sjc [info] level=INFO msg="exiting streaming backup"
2023-08-18T06:12:17.908 app[e2865106ce1148] sjc [info] level=INFO msg="backup stream failed, retrying: fetch position map: backup client error (422): cluster id mismatch, expecting \"LFSC42DFE36BEF76EA1E\""

Not sure what to do at this point.

@kentcdodds Did you restart the machine after you updated the /data/litefs/clusterid?

Also, I just pushed up a change to log out the cluster ID on startup to help debug the issue. Can you try deploying with the flyio/litefs:pr-390 image in your Dockerfile and let me know what the startup logs look like?

I tried restarting the machine and now I’m getting:

2023-08-18T13:59:45.874 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"

I tried updating it to that ID and that didn’t have any impact. I’m deploying a version that includes the log of the cluster id right now, but I’m not sure what to do when that’s done.

Here’s the startup logs with that PR image:

2023-08-18T14:01:46.760 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:01:47.826 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:01:48.891 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:01:49.956 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:01:50.330 runner[4d891262b69138] sjc [info] Successfully prepared image registry.fly.io/epic-stack-template:deployment-01H84F2VZ5YRD7MQZ2BTBKQV6T (16.545660209s)
2023-08-18T14:01:51.022 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:01:52.093 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:01:52.403 runner[4d891262b69138] sjc [info] Configuring firecracker
2023-08-18T14:01:52.548 app[4d891262b69138] sjc [info] [ 0.040926] PCI: Fatal: No config space access function found
2023-08-18T14:01:52.771 app[4d891262b69138] sjc [info] INFO Starting init (commit: b437b5b)...
2023-08-18T14:01:52.792 app[4d891262b69138] sjc [info] INFO Preparing to run: `docker-entrypoint.sh node ./other/sentry-create-release` as root
2023-08-18T14:01:52.798 app[4d891262b69138] sjc [info] INFO [fly api proxy] listening at /.fly/api
2023-08-18T14:01:52.804 app[4d891262b69138] sjc [info] 2023/08/18 14:01:52 listening on [fdaa:0:23df:a7b:16f:11c0:e3f0:2]:22 (DNS: [fdaa::3]:53)
2023-08-18T14:01:52.974 app[4d891262b69138] sjc [info] Missing Sentry environment variables, skipping sourcemap upload.
2023-08-18T14:01:53.158 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:01:53.799 app[4d891262b69138] sjc [info] INFO Main child exited normally with code: 0
2023-08-18T14:01:53.799 app[4d891262b69138] sjc [info] INFO Starting clean up.
2023-08-18T14:01:53.800 app[4d891262b69138] sjc [info] WARN hallpass exited, pid: 256, status: signal: 15 (SIGTERM)
2023-08-18T14:01:53.805 app[4d891262b69138] sjc [info] 2023/08/18 14:01:53 listening on [fdaa:0:23df:a7b:16f:11c0:e3f0:2]:22 (DNS: [fdaa::3]:53)
2023-08-18T14:01:54.224 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:01:54.801 app[4d891262b69138] sjc [info] [ 2.293838] reboot: Restarting system
2023-08-18T14:01:54.930 runner[4d891262b69138] sjc [info] machine restart policy set to 'no', not restarting
2023-08-18T14:01:55.292 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:01:56.357 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:01:57.427 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:01:58.493 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:01:58.749 runner[e2865106ce1148] sjc [info] Pulling container image registry.fly.io/epic-stack-template:deployment-01H84F2VZ5YRD7MQZ2BTBKQV6T
2023-08-18T14:01:59.559 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:00.624 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:01.780 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:02.851 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:03.917 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:04.983 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:06.054 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:07.120 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:08.191 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:09.263 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:09.477 runner[e2865106ce1148] sjc [info] Successfully prepared image registry.fly.io/epic-stack-template:deployment-01H84F2VZ5YRD7MQZ2BTBKQV6T (10.727923533s)
2023-08-18T14:02:09.496 runner[e2865106ce1148] sjc [info] Setting up volume 'data'
2023-08-18T14:02:09.496 runner[e2865106ce1148] sjc [info] Opening encrypted volume
2023-08-18T14:02:09.791 runner[e2865106ce1148] sjc [info] Configuring firecracker
2023-08-18T14:02:09.875 app[e2865106ce1148] sjc [info] INFO Sending signal SIGINT to main child process w/ PID 263
2023-08-18T14:02:10.329 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:11.457 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:12.522 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:13.588 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:14.659 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:14.895 app[e2865106ce1148] sjc [info] INFO Sending signal SIGTERM to main child process w/ PID 263
2023-08-18T14:02:15.724 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:16.789 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:17.855 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:18.928 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-18T14:02:20.025 app[e2865106ce1148] sjc [warn] Virtual machine exited abruptly
2023-08-18T14:02:20.279 app[e2865106ce1148] sjc [info] [ 0.040777] PCI: Fatal: No config space access function found
2023-08-18T14:02:20.503 app[e2865106ce1148] sjc [info] INFO Starting init (commit: b437b5b)...
2023-08-18T14:02:20.524 app[e2865106ce1148] sjc [info] INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2023-08-18T14:02:20.528 app[e2865106ce1148] sjc [info] INFO Resized /data to 1056964608 bytes
2023-08-18T14:02:20.528 app[e2865106ce1148] sjc [info] INFO Preparing to run: `docker-entrypoint.sh litefs mount` as root
2023-08-18T14:02:20.536 app[e2865106ce1148] sjc [info] INFO [fly api proxy] listening at /.fly/api
2023-08-18T14:02:20.543 app[e2865106ce1148] sjc [info] 2023/08/18 14:02:20 listening on [fdaa:0:23df:a7b:182:22a2:372:2]:22 (DNS: [fdaa::3]:53)
2023-08-18T14:02:20.576 app[e2865106ce1148] sjc [info] config file read from /etc/litefs.yml
2023-08-18T14:02:20.576 app[e2865106ce1148] sjc [info] LiteFS 390/merge, commit=997b1ed1a234e910bcdda5c1cabeaea2aaf52721
2023-08-18T14:02:20.576 app[e2865106ce1148] sjc [info] level=INFO msg="host environment detected" type=fly.io
2023-08-18T14:02:20.576 app[e2865106ce1148] sjc [info] level=INFO msg="litefs cloud backup client configured: https://litefs.fly.io"
2023-08-18T14:02:20.577 app[e2865106ce1148] sjc [info] level=INFO msg="Using Consul to determine primary"
2023-08-18T14:02:20.694 app[e2865106ce1148] sjc [info] level=INFO msg="initializing consul: key=epic-litefs/epic-stack-template url=https://:7c3bf8a4-0b8a-aacc-5554-b55bf42b32e2@consul-iad-6.fly-shared.net/epic-stack-template-36dyqmnd44j1gwme/ hostname=e2865106ce1148 advertise-url=http://e2865106ce1148.vm.epic-stack-template.internal:20202"
2023-08-18T14:02:20.696 app[e2865106ce1148] sjc [info] level=INFO msg="wal-sync: no wal file exists on \"cache.db\", skipping sync with ltx"
2023-08-18T14:02:20.698 app[e2865106ce1148] sjc [info] level=INFO msg="wal-sync: no wal file exists on \"sqlite.db\", skipping sync with ltx"
2023-08-18T14:02:20.700 app[e2865106ce1148] sjc [info] level=INFO msg="using existing cluster id: \"LFSC9CCF5315F24FB8D6\""
2023-08-18T14:02:20.705 app[e2865106ce1148] sjc [info] level=INFO msg="LiteFS mounted to: /litefs/data"
2023-08-18T14:02:20.705 app[e2865106ce1148] sjc [info] level=INFO msg="http server listening on: http://localhost:20202"
2023-08-18T14:02:20.705 app[e2865106ce1148] sjc [info] level=INFO msg="waiting to connect to cluster"
2023-08-18T14:02:20.980 app[e2865106ce1148] sjc [info] level=INFO msg="0705F5F9CC594D7A: primary lease acquired, advertising as http://e2865106ce1148.vm.epic-stack-template.internal:20202"
2023-08-18T14:02:21.046 app[e2865106ce1148] sjc [info] level=INFO msg="connected to cluster, ready"
2023-08-18T14:02:21.046 app[e2865106ce1148] sjc [info] level=INFO msg="node is a candidate, automatically promoting to primary"
2023-08-18T14:02:21.046 app[e2865106ce1148] sjc [info] level=INFO msg="node is already primary, skipping promotion"
2023-08-18T14:02:21.046 app[e2865106ce1148] sjc [info] level=INFO msg="proxy server listening on: http://localhost:8080"
2023-08-18T14:02:21.046 app[e2865106ce1148] sjc [info] level=INFO msg="executing command: node [./other/setup-swap.js]"
2023-08-18T14:02:21.047 app[e2865106ce1148] sjc [info] level=INFO msg="begin primary backup stream: url=https://litefs.fly.io"
2023-08-18T14:02:21.272 app[e2865106ce1148] sjc [info] setting up swapfile...
2023-08-18T14:02:21.312 app[e2865106ce1148] sjc [info] swapfile setup complete
2023-08-18T14:02:21.316 app[e2865106ce1148] sjc [info] level=INFO msg="executing command: npx [prisma migrate deploy]"
2023-08-18T14:02:22.048 app[e2865106ce1148] sjc [info] level=INFO msg="begin streaming backup" full-sync-interval=10s
2023-08-18T14:02:22.212 app[e2865106ce1148] sjc [info] level=INFO msg="exiting streaming backup"
2023-08-18T14:02:22.212 app[e2865106ce1148] sjc [info] level=INFO msg="backup stream failed, retrying: fetch position map: backup client error (422): cluster id mismatch, expecting \"LFSC42DFE36BEF76EA1E\""
2023-08-18T14:02:22.213 health[e2865106ce1148] sjc [info] Health check on port 8080 is now passing.
2023-08-18T14:02:22.607 app[e2865106ce1148] sjc [info] Prisma schema loaded from prisma/schema.prisma
2023-08-18T14:02:22.623 app[e2865106ce1148] sjc [info] Datasource "db": SQLite database "sqlite.db" at "file:/litefs/data/sqlite.db"
2023-08-18T14:02:22.693 app[e2865106ce1148] sjc [info] 1 migration found in prisma/migrations
2023-08-18T14:02:22.701 app[e2865106ce1148] sjc [info] No pending migrations to apply.
2023-08-18T14:02:22.734 app[e2865106ce1148] sjc [info] npm notice
2023-08-18T14:02:22.734 app[e2865106ce1148] sjc [info] npm notice New minor version of npm available! 9.6.7 -> 9.8.1
2023-08-18T14:02:22.734 app[e2865106ce1148] sjc [info] npm notice Changelog: <https://github.com/npm/cli/releases/tag/v9.8.1>
2023-08-18T14:02:22.734 app[e2865106ce1148] sjc [info] npm notice Run `npm install -g npm@9.8.1` to update!
2023-08-18T14:02:22.735 app[e2865106ce1148] sjc [info] npm notice
2023-08-18T14:02:22.745 app[e2865106ce1148] sjc [info] level=INFO msg="starting background subprocess: npm [start]"
2023-08-18T14:02:22.748 app[e2865106ce1148] sjc [info] waiting for signal or subprocess to exit
2023-08-18T14:02:23.047 app[e2865106ce1148] sjc [info] level=INFO msg="begin streaming backup" full-sync-interval=10s
2023-08-18T14:02:23.121 app[e2865106ce1148] sjc [info] level=INFO msg="exiting streaming backup"
2023-08-18T14:02:23.121 app[e2865106ce1148] sjc [info] level=INFO msg="backup stream failed, retrying: fetch position map: backup client error (422): cluster id mismatch, expecting \"LFSC42DFE36BEF76EA1E\""
2023-08-18T14:02:23.157 app[e2865106ce1148] sjc [info] > start
2023-08-18T14:02:23.157 app[e2865106ce1148] sjc [info] > cross-env NODE_ENV=production node .
2023-08-18T14:02:24.047 app[e2865106ce1148] sjc [info] level=INFO msg="begin streaming backup" full-sync-interval=10s
2023-08-18T14:02:24.120 app[e2865106ce1148] sjc [info] level=INFO msg="exiting streaming backup"
2023-08-18T14:02:24.120 app[e2865106ce1148] sjc [info] level=INFO msg="backup stream failed, retrying: fetch position map: backup client error (422): cluster id mismatch, expecting \"LFSC42DFE36BEF76EA1E\""
2023-08-18T14:02:24.474 app[e2865106ce1148] sjc [info] 🚀 We have liftoff!
2023-08-18T14:02:24.474 app[e2865106ce1148] sjc [info] Local: http://localhost:8081
2023-08-18T14:02:24.474 app[e2865106ce1148] sjc [info] On Your Network: http://172.19.0.2:8081
2023-08-18T14:02:24.474 app[e2865106ce1148] sjc [info] Press Ctrl+C to stop
2023-08-18T14:02:25.048 app[e2865106ce1148] sjc [info] level=INFO msg="begin streaming backup" full-sync-interval=10s
2023-08-18T14:02:25.120 app[e2865106ce1148] sjc [info] level=INFO msg="exiting streaming backup"
2023-08-18T14:02:25.120 app[e2865106ce1148] sjc [info] level=INFO msg="backup stream failed, retrying: fetch position map: backup client error (422): cluster id mismatch, expecting \"LFSC42DFE36BEF76EA1E\""
2023-08-18T14:02:26.047 app[e2865106ce1148] sjc [info] level=INFO msg="begin streaming backup" full-sync-interval=10s
2023-08-18T14:02:26.121 app[e2865106ce1148] sjc [info] level=INFO msg="exiting streaming backup"
2023-08-18T14:02:26.121 app[e2865106ce1148] sjc [info] level=INFO msg="backup stream failed, retrying: fetch position map: backup client error (422): cluster id mismatch, expecting \"LFSC42DFE36BEF76EA1E\""
2023-08-18T14:02:27.047 app[e2865106ce1148] sjc [info] level=INFO msg="begin streaming backup" full-sync-interval=10s
2023-08-18T14:02:27.119 app[e2865106ce1148] sjc [info] level=INFO msg="exiting streaming backup"
2023-08-18T14:02:27.119 app[e2865106ce1148] sjc [info] level=INFO msg="backup stream failed, retrying: fetch position map: backup client error (422): cluster id mismatch, expecting \"LFSC42DFE36BEF76EA1E\""
2023-08-18T14:02:28.047 app[e2865106ce1148] sjc [info] level=INFO msg="begin streaming backup" full-sync-interval=10s
2023-08-18T14:02:28.120 app[e2865106ce1148] sjc [info] level=INFO msg="exiting streaming backup"
2023-08-18T14:02:28.120 app[e2865106ce1148] sjc [info] level=INFO msg="backup stream failed, retrying: fetch position map: backup client error (422): cluster id mismatch, expecting \"LFSC42DFE36BEF76EA1E\""

I tried setting the clusterid:

root@e2865106ce1148:/myapp# cat /data/litefs/clusterid 
LFSC42DFE36BEF76EA1E

And the I rebooted, and the startup logs said there was a mismatch:

2023-08-18T14:03:58.121 app[e2865106ce1148] sjc [info] level=INFO msg="backup stream failed, retrying: fetch position map: backup client error (422): cluster id mismatch, expecting \"LFSC42DFE36BEF76EA1E\""

:man_shrugging:

I added a "litefs set-cluster-id" command to help make this easier. Can you try deploying with PR build flyio/litefs:pr-391 and running this command on your node:

$ litefs set-cluster-id LFSC42DFE36BEF76EA1E 

Sorry for the trouble with this. The cluster ID is meant to prevent a user from accidentally connecting two separate clusters and wiping out the data on one of the clusters, however, it’s kind of a pain in the butt right now. :grimacing:

That seemed to do the right thing, but now I’m having a problem because it looks like it’s trying to restore from a backup and the database in the backup doesn’t match the new database that I’m trying to set up:

2023-08-19T05:17:29.349 app[e2865106ce1148] sjc [info] level=INFO msg="cannot connect, \"consul\" lease already initialized with different ID: LFSC9CCF5315F24FB8D6"
2023-08-19T05:17:30.198 app[e2865106ce1148] sjc [info] level=INFO msg="cluster id manually updated to \"LFSC42DFE36BEF76EA1E\""
2023-08-19T05:17:30.686 app[e2865106ce1148] sjc [info] level=INFO msg="0705F5F9CC594D7A: primary lease acquired, advertising as http://e2865106ce1148.vm.epic-stack-template.internal:20202"
2023-08-19T05:17:30.758 app[e2865106ce1148] sjc [info] level=INFO msg="connected to cluster, ready"
2023-08-19T05:17:30.758 app[e2865106ce1148] sjc [info] level=INFO msg="node is a candidate, automatically promoting to primary"
2023-08-19T05:17:30.758 app[e2865106ce1148] sjc [info] level=INFO msg="node is already primary, skipping promotion"
2023-08-19T05:17:30.758 app[e2865106ce1148] sjc [info] level=INFO msg="proxy server listening on: http://localhost:8080"
2023-08-19T05:17:30.758 app[e2865106ce1148] sjc [info] level=INFO msg="executing command: node [./other/setup-swap.js]"
2023-08-19T05:17:30.758 app[e2865106ce1148] sjc [info] level=INFO msg="begin primary backup stream: url=https://litefs.fly.io"
2023-08-19T05:17:30.977 app[e2865106ce1148] sjc [info] setting up swapfile...
2023-08-19T05:17:31.015 app[e2865106ce1148] sjc [info] swapfile setup complete
2023-08-19T05:17:31.019 app[e2865106ce1148] sjc [info] level=INFO msg="executing command: npx [prisma migrate deploy]"
2023-08-19T05:17:31.759 app[e2865106ce1148] sjc [info] level=INFO msg="begin streaming backup" full-sync-interval=10s
2023-08-19T05:17:31.920 app[e2865106ce1148] sjc [info] level=WARN msg="restoring from backup" name=sqlite.db pos.local=0000000000000024/a85d8ea707c61f7a pos.remote=00000000000000e6/8698e410d420d2fc reason=remote-ahead
2023-08-19T05:17:31.920 app[e2865106ce1148] sjc [info] level=INFO msg="backup of database \"sqlite.db\" is ahead of local copy, restoring from backup"
2023-08-19T05:17:32.351 app[e2865106ce1148] sjc [info] Prisma schema loaded from prisma/schema.prisma
2023-08-19T05:17:32.372 app[e2865106ce1148] sjc [info] Datasource "db": SQLite database "sqlite.db" at "file:/litefs/data/sqlite.db"
2023-08-19T05:17:32.480 app[e2865106ce1148] sjc [info] 1 migration found in prisma/migrations
2023-08-19T05:17:32.636 app[e2865106ce1148] sjc [info] level=INFO msg="snapshot received for \"sqlite.db\", removing other ltx files except: 0000000000000001-00000000000000e6.ltx.tmp"
2023-08-19T05:17:32.655 app[e2865106ce1148] sjc [info] level=WARN msg="database restore complete" name=sqlite.db pos=00000000000000e6/8698e410d420d2fc elapsed=660.29322ms
2023-08-19T05:17:32.674 app[e2865106ce1148] sjc [info] Applying migration `20230811194017_init`
2023-08-19T05:17:32.678 app[e2865106ce1148] sjc [info] Error: P3018
2023-08-19T05:17:32.678 app[e2865106ce1148] sjc [info] A migration failed to apply. New migrations cannot be applied before the error is recovered from. Read more about how to resolve migration issues in a production database: https://pris.ly/d/migrate-resolve
2023-08-19T05:17:32.678 app[e2865106ce1148] sjc [info] Migration name: 20230811194017_init
2023-08-19T05:17:32.678 app[e2865106ce1148] sjc [info] Database error code: 1
2023-08-19T05:17:32.678 app[e2865106ce1148] sjc [info] Database error:
2023-08-19T05:17:32.678 app[e2865106ce1148] sjc [info] table "User" already exists in -- CreateTable
... more sql errors

It may be useful to say that I was trying to delete the volume to make it easier to completely erase the data and tables in the database and start fresh. That was the original goal here. I know that’s not something people normally do, but in my template I need things to almost always be like it was just deployed for the first time.

LiteFS Cloud aims to be the authority on the state of the database so when you spin up a brand new instance it’ll automatically restore to the most recent version. You could delete and recreate the cluster on LiteFS Cloud and then it’ll be brand new.

Yeah, so clearly this is not the way that things should be done if I ever want to blow away the database and start fresh :sweat_smile: I’ll have to think about it a little more next time…

I’ve deleted the LiteFS Cloud cluster and env var on this app now (it’s not necessary anyway and I only had it to test the feature out).

I also deleted the volume and tried to deploy without a mounted volume, that didn’t end up working and I didn’t snag the logs for why, sorry. So I created a new volume and deployed with the mount config again and now I’m getting:

2023-08-22T22:45:54.181 app[6e82d2eeb63d08] sjc [info] level=INFO msg="cannot find primary, retrying: no primary"
2023-08-22T22:45:55.247 app[6e82d2eeb63d08] sjc [info] level=INFO msg="cannot become primary, local node has no cluster ID and \"consul\" lease already initialized"

I tried running fly consul attach just in case but that didn’t do anything. I’m stuck again.

I would just give up on this app and create a new one, but I really don’t want to change the app ID :upside_down_face:

We’ve been discussing the cluster ID on the team the last couple of days since it’s caused a lot of confusion and wonky issues. I think we’re going to revamp it and go with an optional, user-specified cluster ID instead of generating in LiteFS. That way you’ll have the ID (since you generated it) so you won’t lose it if you delete your volume. The cluster ID can even be set to "$FLY_APP_NAME" just to keep things simple.

I should be able to get that working pretty soon. I’ll let you know when it’s ready.