I’m trying to migrate the app kcd
. It’s a Node.js app running with LiteFS.
Current status: Fly appears to be sending all traffic to a non-migrated version of my app, but I’m also seeing logs for an instance of my app that has been migrated but is experiencing errors.
I tried a straightforward migration, and ended up getting an error that I should have saved to share with you, but I did not. It was something about not having enough room on the volume for the migration or something It gave me three options of something to try. I allowed it to auto-rollback and decided to try scaling down to a single instance (from the 8 I was running) and delete all but the attached volume. One of the volumes was stubborn and I kept getting:
~/code/kentcdodds.com (main) 🏎
$ fly vol delete vol_2en7r1pyg0zrk6yx -y
Error: failed destroying volume: You hit a Fly API error with request ID: 01H45W7YT6HYCRZ7DW87H9EDT1-sjc
So I just decided to proceed anyway.
That’s when things got kinda weird. Here are all the logs from the migration attempt:
/app/bin $ fly vol delete vol_2en7r1pyg0zrk6yx -y
Error: failed destroying volume: You hit a Fly API error with request ID: 01H45WACAK6G0Y2A5MNWE44X0N-lax
/app/bin $ fly config save -a kcd
Wrote config file fly.toml
/app/bin $ fly migrate-to-v2
This migration process will do the following, in order:
* Lock your application, preventing changes during the migration
* Remove legacy VMs
* Remove 1 alloc
* NOTE: Because your app uses volumes, there will be a short downtime during migration while your machines start up.
* Create clones of each volume in use, for the new machines
* These cloned volumes will have the suffix '_machines' appended to their names
* Please note that your old volumes will not be removed.
(you can do this manually, after making sure the migration was a success)
* Create machines, copying the configuration of each existing VM
* Create 1 "app" machine
* Set the application platform version to "machines"
* Unlock your application
* Overwrite the config file at '/app/bin/fly.toml'
? Would you like to continue? Yes
==> Migrating kcd to the V2 platform
> Locking app to prevent changes during the migration
> Making snapshots of volumes for the new machines
> Scaling down to zero VMs. This will cause temporary downtime until new VMs come up.
> Enabling machine creation on app
> Creating an app release to register this migration
> Starting machines
INFO Using wait timeout: 5m0s lease timeout: 13s delay between lease refreshes: 4s
Updating existing machines in 'kcd' with rolling strategy
[1/1] Waiting for 5683777f75098e [app] to become healthy: 1/2
failed while migrating: timeout reached waiting for healthchecks to pass for machine 5683777f75098e failed to get VM 5683777f75098e: Get "https://api.machines.dev/v1/apps/kcd/machines/5683777f75098e": net/http: request canceled
note: you can change this timeout with the --wait-timeout flag
? Would you like to enter interactive troubleshooting mode? If not, the migration will be rolled back.
Here are the relevant logs I had during the migration:
2023-06-30T10:08:20Z app[9449075a] den [info]GET kentcdodds.com/blog/the-state-reducer-pattern 200 - 210.467 ms
2023-06-30T10:08:20Z app[9449075a] den [info]POST kentcdodds.com/__metronome 204 - 33.540 ms
2023-06-30T10:08:20Z app[9449075a] den [info]POST kentcdodds.com/__metronome 204 - 27.760 ms
2023-06-30T10:08:21Z runner[9449075a] den [info]Shutting down virtual machine
2023-06-30T10:08:21Z app[9449075a] den [info] INFO Sending signal SIGINT to main child process w/ PID 246
2023-06-30T10:08:21Z app[9449075a] den [info]sending signal to exec process
2023-06-30T10:08:21Z app[9449075a] den [info]waiting for exec process to close
2023-06-30T10:08:21Z app[9449075a] den [info]signal received, litefs shutting down
2023-06-30T10:08:21Z app[9449075a] den [info]6515368117173208084CEBF0: exiting primary, destroying lease
2023-06-30T10:08:21Z app[9449075a] den [info]ERROR: exit status 1: fusermount3: failed to unmount /litefs: Device or resource busy
2023-06-30T10:08:21Z app[9449075a] den [info] INFO Starting clean up.normally with code: 1
2023-06-30T10:08:21Z app[9449075a] den [info] INFO Starting clean up.
2023-06-30T10:08:21Z app[9449075a] den [info] INFO Umounting /dev/vdc from /data
2023-06-30T10:08:21Z app[9449075a] den [info] WARN hallpass exited, pid: 247, status: signal: 15 (SIGTERM)
2023-06-30T10:08:24Z proxy[9449075a] den [error]timed out while connecting to your instance. this indicates a problem with your app (hint: look at your logs and metrics)
2023-06-30T10:08:25Z proxy[9449075a] den [error]timed out while connecting to your instance. this indicates a problem with your app (hint: look at your logs and metrics)
2023-06-30T10:08:31Z runner[5683777f75098e] den [info]Pulling container image registry.fly.io/kcd@sha256:23a2cefb72eb42bb9d822b74b1479402daf439cbe906593b398be04301f74882
2023-06-30T10:09:05Z runner[5683777f75098e] den [info]Successfully prepared image registry.fly.io/kcd@sha256:23a2cefb72eb42bb9d822b74b1479402daf439cbe906593b398be04301f74882 (33.979861304s)
2023-06-30T10:09:05Z runner[5683777f75098e] den [info]Setting up volume 'data_machines'
2023-06-30T10:09:05Z runner[5683777f75098e] den [info]Opening encrypted volume
2023-06-30T10:09:14Z runner[5683777f75098e] den [info]Configuring firecracker
2023-06-30T10:09:14Z app[5683777f75098e] den [info] INFO Starting init (commit: db101a53)...
2023-06-30T10:09:14Z app[5683777f75098e] den [info] INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2023-06-30T10:09:15Z app[5683777f75098e] den [info] INFO Resized /data to 3217031168 bytes
2023-06-30T10:09:15Z app[5683777f75098e] den [info] INFO Preparing to run: `docker-entrypoint.sh litefs mount -- node ./other/start.js` as root
2023-06-30T10:09:15Z app[5683777f75098e] den [info] INFO [fly api proxy] listening at /.fly/api
2023-06-30T10:09:15Z app[5683777f75098e] den [info]2023/06/30 10:09:15 listening on [fdaa:0:23df:a7b:d828:18fd:6e90:2]:22 (DNS: [fdaa::3]:53)
2023-06-30T10:09:15Z health[5683777f75098e] den [warn]Health check on port 8080 is in a 'warning' state. Your app may not be responding properly. Services exposed on ports [80, 443] may have intermittent failures until the health check passes.
2023-06-30T10:09:15Z health[5683777f75098e] den [warn]Health check on port 8080 is in a 'warning' state. Your app may not be responding properly. Services exposed on ports [80, 443] may have intermittent failures until the health check passes.
2023-06-30T10:09:15Z app[5683777f75098e] den [info]Using Consul to determine primary
2023-06-30T10:09:15Z app[5683777f75098e] den [info]initializing consul: key=litefs/kcd url= hostname=5683777f75098e advertise-url=http://5683777f75098e.vm.kcd.internal:20202
2023-06-30T10:09:15Z app[5683777f75098e] den [info]config file read from /etc/litefs.yml
2023-06-30T10:09:15Z app[5683777f75098e] den [info]LiteFS main, commit=9ff02a303b5fc2e5c8bef5a173ab96dc4ab1c393
2023-06-30T10:09:15Z app[5683777f75098e] den [info]wal-sync: no wal file exists on "cache.db", skipping sync with ltx
2023-06-30T10:09:15Z app[5683777f75098e] den [info]wal-sync: no wal file exists on "sqlite.db", skipping sync with ltx
2023-06-30T10:09:16Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:16Z app[5683777f75098e] den [info]LiteFS mounted to: /litefs
2023-06-30T10:09:16Z app[5683777f75098e] den [info]http server listening on: http://localhost:20202
2023-06-30T10:09:16Z app[5683777f75098e] den [info]waiting to connect to cluster
2023-06-30T10:09:17Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:18Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:19Z runner[5683777f75098e] den [info]Pulling container image registry.fly.io/kcd@sha256:23a2cefb72eb42bb9d822b74b1479402daf439cbe906593b398be04301f74882
2023-06-30T10:09:19Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:19Z runner[5683777f75098e] den [info]Successfully prepared image registry.fly.io/kcd@sha256:23a2cefb72eb42bb9d822b74b1479402daf439cbe906593b398be04301f74882 (637.195691ms)
2023-06-30T10:09:19Z runner[5683777f75098e] den [info]Setting up volume 'data_machines'
2023-06-30T10:09:19Z runner[5683777f75098e] den [info]Opening encrypted volume
2023-06-30T10:09:20Z proxy[5683777f75098e] den [error]machine is in a non-startable state: created
2023-06-30T10:09:20Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:20Z proxy[5683777f75098e] den [error]machine is in a non-startable state: created
2023-06-30T10:09:20Z proxy[5683777f75098e] den [error]machine is in a non-startable state: created
2023-06-30T10:09:20Z proxy[5683777f75098e] den [error]machine is in a non-startable state: created
2023-06-30T10:09:21Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:21Z proxy[5683777f75098e] den [error]machine is in a non-startable state: created
2023-06-30T10:09:22Z proxy[5683777f75098e] den [error]machine is in a non-startable state: created
2023-06-30T10:09:22Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:22Z app[5683777f75098e] den [info] INFO Sending signal SIGINT to main child process w/ PID 240
2023-06-30T10:09:23Z health[5683777f75098e] den [error]Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
2023-06-30T10:09:23Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:23Z proxy[5683777f75098e] den [error]machine is in a non-startable state: replacing
2023-06-30T10:09:24Z proxy[5683777f75098e] den [error]machine is in a non-startable state: replacing
2023-06-30T10:09:24Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:25Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:25Z proxy[5683777f75098e] den [error]machine is in a non-startable state: replacing
2023-06-30T10:09:26Z proxy[5683777f75098e] den [error]machine is in a non-startable state: replacing
2023-06-30T10:09:26Z health[5683777f75098e] den [info]Health check on port 8080 is now passing.
2023-06-30T10:09:26Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:27Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:27Z app[5683777f75098e] den [info] INFO Sending signal SIGTERM to main child process w/ PID 240
2023-06-30T10:09:27Z proxy[5683777f75098e] den [error]machine is in a non-startable state: replacing
2023-06-30T10:09:28Z proxy[5683777f75098e] den [error]machine is in a non-startable state: replacing
2023-06-30T10:09:28Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:29Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:30Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:31Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:32Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:32Z app[5683777f75098e] den [warn]Virtual machine exited abruptly
2023-06-30T10:09:33Z app[5683777f75098e] den [info] INFO Starting init (commit: db101a53)...
2023-06-30T10:09:33Z app[5683777f75098e] den [info] INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2023-06-30T10:09:33Z app[5683777f75098e] den [info] INFO Resized /data to 3217031168 bytes
2023-06-30T10:09:33Z app[5683777f75098e] den [info] INFO Preparing to run: `docker-entrypoint.sh litefs mount -- node ./other/start.js` as root
2023-06-30T10:09:33Z app[5683777f75098e] den [info] INFO [fly api proxy] listening at /.fly/api
2023-06-30T10:09:33Z app[5683777f75098e] den [info]2023/06/30 10:09:33 listening on [fdaa:0:23df:a7b:d828:18fd:6e90:2]:22 (DNS: [fdaa::3]:53)
2023-06-30T10:09:33Z app[5683777f75098e] den [info]Using Consul to determine primary
2023-06-30T10:09:33Z app[5683777f75098e] den [info]config file read from /etc/litefs.yml
2023-06-30T10:09:33Z app[5683777f75098e] den [info]LiteFS main, commit=9ff02a303b5fc2e5c8bef5a173ab96dc4ab1c393
2023-06-30T10:09:33Z app[5683777f75098e] den [info]initializing consul: key=litefs/kcd url= hostname=5683777f75098e advertise-url=http://5683777f75098e.vm.kcd.internal:20202
2023-06-30T10:09:33Z app[5683777f75098e] den [info]wal-sync: no wal file exists on "cache.db", skipping sync with ltx
2023-06-30T10:09:34Z app[5683777f75098e] den [info]wal-sync: no wal file exists on "sqlite.db", skipping sync with ltx
2023-06-30T10:09:35Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:35Z app[5683777f75098e] den [info]LiteFS mounted to: /litefs
2023-06-30T10:09:35Z app[5683777f75098e] den [info]http server listening on: http://localhost:20202
2023-06-30T10:09:35Z app[5683777f75098e] den [info]waiting to connect to cluster
2023-06-30T10:09:36Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:37Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:37Z proxy[5683777f75098e] fra [error]could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shutdown? is there an ongoing deployment with a volume or using the 'immediate' strategy? has your app's instances all reached their hard limit?)
2023-06-30T10:09:38Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:39Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:40Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:41Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:42Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:43Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:44Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:45Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:46Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:47Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:48Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:49Z app[5683777f75098e] den [info]6515368117173208084CEBF0: cannot acquire lease or find primary, retrying: fetch primary url: Get "http://127.0.0.1:8500/v1/kv/litefs/kcd": dial tcp 127.0.0.1:8500: connect: connection refused
2023-06-30T10:09:49Z proxy[5683777f75098e] lhr [error]could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shutdown? is there an ongoing deployment with a volume or using the 'immediate' strategy? has your app's instances all reached their hard limit?)
And it continued like that through the entire timeout period for the migration. So then I decided to try interactive troubleshooting mode (maybe I should’ve rolled back?). Here’s what I did then:
Oops! We ran into issues migrating your app.
We're constantly working to improve the migration and squash bugs, but for
now please let this troubleshooting wizard guide you down a yellow brick road
of potential solutions...
,,,,,
,,.,,,,,,,,, .
.,,,,,,,
,,,,,,,,,.,,
,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,.
, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
The app's platform version is 'detached'
This means that the app is stuck in a half-migrated state, and wasn't able to
be fully recovered during the migration error rollback process.
Fixing this depends on how far the app got in the migration process.
Please use these tools to troubleshoot and attempt to repair the app.
No legacy Nomad VMs found. Setting platform version to machines/Apps V2.
/app/bin $ fly status
App
Name = kcd
Owner = personal
Hostname = kcd.fly.dev
Image = kcd:
Platform = machines
Machines
PROCESS ID VERSION REGION STATE CHECKS LAST UPDATED
app 5683777f75098e 379 den started 2 total, 1 passing, 1 critical 2023-06-30T10:09:33Z
/app/bin $ fly restart
Error: this command has been removed. please use `fly apps restart` instead
/app/bin $ fly apps restart kcd
Restarting machine 5683777f75098e
Waiting for 5683777f75098e to become healthy (started, 1/2)
Machine 5683777f75098e restarted successfully!
/app/bin $
Interestingly, seconds after I ran fly apps restart kcd
I started getting a mix of application logs (from what I believe is an old unmigrated version of the app) and error logs from the new app. Discourse won’t let me post many more characters, so I’ll add the logs in a follow-up comment.
Here’s what my monitoring page looks like now:
So yeah, my site is running ok for users and whatnot, but it’d be pretty cool to finish this migration. I think it makes the most sense to figure out what’s wrong with the migration process, fix that, then start it over again to avoid data loss from the time I started this migration.
Also, as an aside, I’d love to be able to get my volumes renamed from data_machines
back to simply data
when this is finished if at all possible
Any help is appreciated!