fly migrate-to-v2: Apps with volumes support 🎉

For two weeks, we’ve had a migration tool out, fly migrate-to-v2, which automagically converts your apps from our legacy V1 orchestration to our shiny new Apps V2 platform!

With flyctl v0.0.523 (already released!), migrate-to-v2 finally supports apps with volumes!
(even the per-process-group mounts that landed just yesterday)

Usage

Update flyctl (flyctl version update), then run fly migrate-to-v2 in the root of your project (next to fly.toml). That’s it!

During the migration, each volume in your app will be cloned, and the newly-created machines will use these new volumes. (they’ll have the suffix -machines appended to their names)
This is for data integrity reasons - if the migration doesn’t go perfectly smooth, you have pristine, untouched copies of your data. After you’ve migrated your app, and verified that your app works, you can safely delete the old volumes without the -machines suffix.

Also of note: unlike the migration process for apps without volumes and postgres, migration for standard apps with volumes does incur slight downtime. The old VMs have to powered down before snapshots can be taken of their contents, again for data integrity reasons. The downtime is usually only about ten/twenty seconds in my experience, the time between the last nomad VM shutting down and the first machine being deployed, but that might not be acceptable for some usecases. (if this is you, please talk to us about it in this thread!)


For apps with volumes, the move to V2 should basically be a strict upgrade. You’ll get the flexibility and stability of machines, and the low-level nature of volumes (being bound to hardware) tends to just make sense when paired with machines.

If you run into any issues migrating your app, please let us know! We love feedback, and want to make sure these tools work for you.
Thanks!

15 Likes

Just updated a multi-region Flask app with no problems in around 90 secs, and can confirm that subsequently deploying via GH Actions using the updated fly.toml works as expected. It’s a very un-exotic app, but still.

4 Likes

Is the migration rolling? Or do all the VMs need to be shutdown before the new machines can be spun up?

We’re using CockroachDB and a rolling migration would offer us the potential for a live, no downtime migration if we can migrate each node 1 by 1 and have the node finish connecting to the cluster before the next node is migrated.

It currently shuts everything down, forks the volumes, then brings up the new machines.

We’d love to ship zero-downtime migrations for apps with volumes, it’s just very difficult (if at all) possible to do this in a way that doesn’t jeopardize data integrity.

For specialized use-cases, where proceeding through various steps of the migration process requires application-specific behavior (such as checking to see if your nodes are all connected :slight_smile: ), it’s theoretically possible to hack that into flyctl. It’s open source, and our migration process is entirely client-side.

Is it possible to trigger the migration steps manually through the CLI?

flyctl v0.0.533 (should be releasing in an hour or so, maybe a day for it to land in homebrew) will expose the hidden command fly apps set-platform-version <nomad|detached|machines>.

In this case, the automated process for migrating an app with volumes could be expressed something like this:

fly apps set-platform-version detached
fly scale count 0
# For each volume
fly vol fork --name "oldname-machines" <vol-id>
# For each nomad VM that originally existed
fly m run <the-app-image> -m fly_platform_version=v2 -p <port-mapping> -size <vm-size> -v <volume-id>:/mount/point
fly apps set-platform-version machines
# Edit mount points in fly.toml to point to "<oldvolname>-machines"
fly deploy

If your database can sync a node from an existing node, it might be possible to do something like this

fly apps set-platform-version detached

# Create new volumes
fly vol create "oldname-machines" -s <size> -r <region>
# or if you're feeling lucky, but I don't recommend this for data integrity reasons while the VM is running
fly vol fork --name "oldname-machines" <vol-id>

# Create new machines
fly m run <the-app-image> -m fly_platform_version=v2 -p <port-mapping> -size <vm-size> -v <volume-id>:/mount/point
fly scale count 0
fly apps set-platform-version machines
# Edit mount points in fly.toml to point to "<oldvolname>-machines"
fly deploy

In live service surgery like this, it’s really easy to break something. good luck!

1 Like

Hi! I just tried this and got the following error

fly migrate-to-v2 -a *****-c ./db/fly.toml --primary-region fra
This migration process will do the following, in order:
 * Lock your application, preventing changes during the migration
 * Remove legacy VMs 
   * Remove 1 alloc
   * NOTE: Because your app uses volumes, there will be a short downtime during migration while your machines start up.
 * Create clones of each volume in use, for the new machines
   * These cloned volumes will have the suffix '_machines' appended to their names
   * Please note that your old volumes will not be removed.
     (you can do this manually, after making sure the migration was a success)
 * Create machines, copying the configuration of each existing VM
   * Create 1 "app" machine
 * Set the application platform version to "machines"
 * Unlock your application
 * Overwrite the config file at '/home/einar/Projects/****/db/fly.toml'
? Would you like to continue? (y/N) y
? Would you like to continue? Yes
==> Migrating klimsek to the V2 platform
>  Locking app to prevent changes during the migration
>  Making snapshots of volumes for the new machines
failed while migrating: volume pb_data[vol_g2yxp4mo****] is mounted on alloc 84669****, but has no mountpoint
==> (!) An error has occurred. Attempting to rollback changes...
>  Successfully recovered
>  Unlocking application
Error: volume pb_data[vol_g2yxp4mo****] is mounted on alloc 84669****, but has no mountpoint

I note the error message “…has no mountpoint” but I don’t know what to do about it.

Please help! :confused:

Edit: This is my fly.toml

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[build]
dockerfile = "./Dockerfile"

[env]

[experimental]
auto_rollback = true

[[services]]
http_checks = []
internal_port = 8080
processes = ["app"]
protocol = "tcp"
script_checks = []
[services.concurrency]
hard_limit = 30
soft_limit = 20
type = "connections"

[[services.ports]]
force_https = true
handlers = ["http"]
port = 80

[[services.ports]]
handlers = ["tls", "http"]
port = 443

[[services.tcp_checks]]
grace_period = "1s"
interval = "15s"
restart_limit = 0
timeout = "2s"

[mounts]
destination = "/pb/pb_data"
source = "pb_data"
1 Like

Got the same issue here on similar configs after running fly migrate-to-v2:

Error: volume eqai_data[vol_pkl7vzkjd6qvqg60] is mounted on alloc a68fc3bf-76b9-9e5e-0ee8-33047bebdddd, but has no mountpoint

fly.toml

[mounts]
  source="eqai_data"
  destination="/data"

Any thoughts? Thanks.

Hey @eipe and @indiependente, that’s a nasty regression, and I’m sorry that it somehow slipped past our release testing. Thanks for letting us know!

Just wanted to let you know that a fix for that is being deployed right now, and it’ll make it into flyctl v0.0.542 (which should release in about twenty minutes, might take up to 24 hours for homebrew users)

2 Likes

Great. I just ran it and it seems to have worked, overall.

I now have a couple of duplicates though, which I suspect is due to the previous unsuccessful migration. And I guess I can just remove the ones not attached to the VM.

ID                      STATE   NAME                            SIZE    REGION  ZONE    ENCRYPTED       ATTACHED VM     CREATED AT    
vol_pkl7v*****    created pb_data_machines                1GB     fra     8ee7    true            32874*****  3 minutes ago
vol_8l524*****    created pb_data_machines_machines       1GB     fra     8ee7    true                            3 minutes ago
vol_d7xkr*****    created pb_data_machines                1GB     fra     8ee7    true                            1 day ago    
vol_g2yxp*****    created pb_data                         1GB     fra     8ee7    true                            2 months ago 

I also get the following errors in the fly.toml now:

(maybe this last thing is just related to this. Well well)

I’m running fly migrate-to-v2 on an app running in 24 regions with a 30GB volume in each region, and I got the following error:

==> Migrating <app> to the V2 platform
>  Locking app to prevent changes during the migration
>  Making snapshots of volumes for the new machines
failed while migrating: You hit a Fly API error with request ID: 01GYZWW556KGYP7017RM00VZ9C-sjc
==> (!) An error has occurred. Attempting to rollback changes...
>  Successfully recovered
>  Unlocking application
Error: You hit a Fly API error with request ID: 01GYZWW556KGYP7017RM00VZ9C-sjc

Any idea what caused this?

Note: There’s not much information for me to further debug this issue. It’d be great if there are some follow up / manual instructions to dig into, or display which Fly API errored.

CLI version: flyctl v0.0.544 darwin/arm64 Commit: 803df3fc

You can run the command with LOG_LEVEL=debug and you’ll see the GraphQL queries. I am also having issues upgrading with similar issues - been working with support to get them resolved but so far have nothing to show.

We ran flyctl migrate-to-v2 on two apps and we found on both apps that secrets that had been previously “unset” in the old application somehow reappeared. On the second app the reappearance of unset secrets broke our deployment.

flyctl version: 0.0.544

We’ve had a few reports of secrets-related issues with the migrate-to-v2 command. We are looking into it!

1 Like

Looks like with apps that have many regions and volumes, we get the following graphql error:

DEBUG <-- 200 https://api.fly.io/graphql (118.93ms)

{
  "data": {
    "resumeApp": null
  },
  "errors": [
    {
      "message": "App is not suspended",
      "locations": [
        {
          "line": 1,
          "column": 38
        }
      ],
      "path": [
        "resumeApp"
      ],
      "extensions": {
        "code": "UNPROCESSABLE"
      }
    }
  ]
}

This is probably due to the instance being suspended in that region, then the volume gets copied over, then it tries to resume the instance but failed with the above error.

After the initial fly migrate-to-v2, we see one single orphaned new volume created without any instance attached, and the above error shows up. Any ideas?

Getting a errors after running
fly version update
then
fly migrate-to-v2
on both my applications.

This is more of a complicated deployment which uses a volume and relies on a Postgres DB

==> Migrating fly-pleroma-alpine-alpha to the V2 platform
>  Locking app to prevent changes during the migration
>  Making snapshots of volumes for the new machines
failed while migrating: Disk id 590 is not a valid candidate
==> (!) An error has occurred. Attempting to rollback changes...
>  Successfully recovered
>  Unlocking application
Error: Disk id 590 is not a valid candidate

My simple Telegram bot also fails to migrate.

? Would you like to continue? Yes
==> Migrating ssb-tg to the V2 platform
>  Locking app to prevent changes during the migration
>  Making snapshots of volumes for the new machines
failed while migrating: Disk id 679 is not a valid candidate
==> (!) An error has occurred. Attempting to rollback changes...
>  Successfully recovered
>  Unlocking application
Error: Disk id 679 is not a valid candidate

For a production app, you should have at least two VMs provisioned for each process. Unlike Nomad, flyd does not attempt to move VMs if their host hardware fails.

This isn’t ideal, are there any plans to support single-instance use cases ? For a simple app that doesn’t get much traffic, maintaining two VMs seems unnecessarily expensive.

Also curious, in the two VM case, if one of them has a host hardware failure what happens? One of them just stays in a failed state?

image

everything was fine before migrating of my telegram bot to appsv2 using ‘fly migrate-to-v2’, now I am receiving bad escape and unbalanced parenthesis errors.

To avoid spamming everyone’s inboxes, I’m going to knock out a bunch of responses in one go.

@pu94x this is really strange! my immediate guess is that when you migrated, it might’ve rebuilt the docker image, and in the time since your first deploy and now, something changed in the base image?

The output looks like it’s failing to load a config file somewhere. If this is a config file, is it stored in the docker image, or on the mounted volume? And if you fly ssh console into the VM, is the file intact?

@vimota if you set up multiple machines and configure autostart/autostop, you should be able to get high-availability without having to pay for VMs you’re not using. We’ve been trying to scale back on “magic” in favor of configurable primitives so that our platform is more predictable and to give developers more control.

For apps without volumes, hardware failure should be temporary, with the machine coming back once that incident is resolved. (this is how it’s supposed to work, at least, but we’re not 100% there yet. you might have to contact support to get machines moved off unhealthy hosts if something blows up in the next month or so)
We don’t have a good automated recovery story yet for apps with volumes. For now, recovery means manually creating a volume based on a backup snapshot (fly vol snapshots list → fly vol create --snapshot-id <id> <name>), then creating a new machine with that volume attached. The data might be a little stale (I believe up to a day), but it’s better than being offline.

@ethanrjones97 I believe the hosts those volumes are on may not have enough space to provision the cloned volumes. For now, we don’t have a way to transfer that data to a new host.

@lucille “App is not suspended” is a red herring. Machines apps, way back when, would use the suspended flag to convey state, so when a migration fails, sometimes the nomad app ends up marked as suspended (even though it’s running fine), so we unconditionally resume when rolling back a migration.

The one orphaned volume is showing up because the process failed after it created that single volume. This is a bug - rollback after an error should cleanup everything - but it’s also not the real issue at hand here.

The error you’re encountering is strange and difficult to reproduce. We’re aware of it and trying to diagnose it.

@eipe Happy to hear you were finally able to migrate! You hit the same thing mentioned above, where failed migrations don’t cleanup their mess afterwards - sorry! That said, a successful migration doesn’t remove the old volumes, leaving that up to the user - an automated process deleting user data, no matter how much faith you put in the process, seems like a bad idea.

5 Likes

Trying to upgrade an app I got the error:

Error: cannot migrate app farmish with autoscaling config, yet; watch https://community.fly.io for announcements about autoscale support with migrations

I thought I’d just disable autoscaling to do the upgrade, but I discovered that’s not working (See: Autoscale disable not working?)

Is there a way around this yet?