Sunsetting Nomad

I tried that. fly migrate-to-v2 displayed this at some point:

INFO Using wait timeout: 5m0s lease timeout: 13s delay between lease refreshes: 4s

failed while migrating: Process group 'app' needs volumes with name 'pg_data_machines' to fullfill mounts defined in fly.toml; Run `fly volume create pg_data_machines -r REGION` for the following regions and counts: nrt=1

So I ran the mentioned command and then ran fly migrate-to-v2 again, which ended with:

==> Migrating ...-db to the V2 platform
>  Upgrading postgres image
>  Setting postgres primary to readonly
>  Creating new postgres volumes
>  Locking app to prevent changes during the migration
>  Enabling machine creation on app
>  Creating an app release to register this migration
>  Starting machines
INFO Using wait timeout: 5m0s lease timeout: 13s delay between lease refreshes: 4s

Updating existing machines in '...-db' with rolling strategy

-------
 ⠋ Waiting for 1234567890 [app] to become healthy: 1/3
-------
failed while migrating: timeout reached waiting for healthchecks to pass for machine 1234567890 failed to get VM 1234567890: Get "https://api.machines.dev/v1/apps/...-db/machines/1234567890": net/http: request canceled
? Would you like to enter interactive troubleshooting mode? If not, the migration will be rolled back. (Y/n)

Hitting Y returns:

Oops! We ran into issues migrating your app.
We're constantly working to improve the migration and squash bugs, but for
now please let this troubleshooting wizard guide you down a yellow brick road
of potential solutions...
               ,,,,,
       ,,.,,,,,,,,, .
   .,,,,,,,
  ,,,,,,,,,.,,
     ,,,,,,,,,,,,,,,,,,,
         ,,,,,,,,,,,,,,,,,,,,
            ,,,,,,,,,,,,,,,,,,,,,
           ,,,,,,,,,,,,,,,,,,,,,,,
        ,,,,,,,,,,,,,,,,,,,,,,,,,,,,.
   , ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

The app's platform version is 'detached'
This means that the app is stuck in a half-migrated state, and wasn't able to
be fully recovered during the migration error rollback process.

Fixing this depends on how far the app got in the migration process.
Please use these tools to troubleshoot and attempt to repair the app.

when I run fly migrate-to-v2 I get timeout error and when I choose interactive section I get the options shown in the screenshot. I am confused which option to choose so my data is not lost. any help will be appreciated. Thanks

Hi! I looked up your app in the backend. It seems like the issue was just network trouble - your health checks for the migrated machines are passing and it looks like everything is running.

You can run fly migrate-to-v2 troubleshoot to return to that screen if you’ve since closed flyctl, and you’ll want to select “Destroy remaining Nomad VMs and use Apps V2.” (for what it’s worth, even if something were to go wrong, you’d still be able to recover the old VMs, so don’t worry about losing anything)

I’m so sorry for the confusion here - that definitely should not be printed during a migration, and its suggestion is incorrect. (internally migrate-to-v2 calls fly deploy, which is where that error comes from, but it’s not applicable to a migration. definitely a bug)

To repair things, you should run fly migrate-to-v2 troubleshoot to get back to that troubleshooting wizard, and choose the option that says something along the lines of “destroy existing Machines and use Nomad” Additionally, you should delete the volume you created - it’ll be the one with the suffix _machines.

I’m going to check and see what went wrong here, and what we can do to make that migration work. In the meantime, those instructions should get you back on Nomad and running stable.

Alright, done that.

Hi @allison
Did you have time to look at the debug logs?

I tried a few other solutions from this thread but always end up with the same Error: 404: 404 page not found.

Ok, that bug should be fixed now! Sorry for the trouble.

Before migrating, run fly version show to double-check that you’re on flyctl v0.1.108, then migrate-to-v2 should work on your database. (it might take a day or so for that to hit Homebrew, if that’s your package manager of choice)

Hi! I haven’t been able to determine the root cause of this.

In the meantime, if a little bit of downtime is acceptable, you can try:

fly migrate-to-v2 --force-standard-migration

This flag sidesteps all the smart postgres-specific migration code that keeps your db online during the migration, but in exchange there are a lot fewer moving parts. (including the specific request that’s failing in those debug logs)

1 Like

It worked! Thanks for the support @allison

1 Like

@allison I’m getting the following error running fly migrate-to-v2

DEBUG gqlErr: <nil> agentErr: <nil>

DEBUG flypg will connect to: http://fdda:...:3:5500

DEBUG --> GET http://fdda:...:3:5500/commands/admin/role

DEBUG <-- 500 http://fdda:...:3:5500/commands/admin/role (5.15s)

DEBUG {
  "error": "context deadline exceeded"
}


DEBUG Task manager done
Error: can't get role for fdda:...:3: 500: context deadline exceeded

Any ideas for things to try?

I am getting an error when trying to migrate my postgres app to v2. It says it can’t create the volume, it returns a status of 503. Not sure what to do.

Thanks, it worked now!

1 Like

So, that IPV6 address is the IP of the leader node (via VPN into your org’s network) in that Postgres cluster. The endpoints on the node should definitely not be returning 500 errors.

At the same time, there’s a time for us to sit down and figure out what’s causing that, and that time is not 18 days before Nomad gets removed haha. I think, right now, you should double-check that your database is OK. fly pg connect and look around, just make sure things are still working.

If it all looks good, I’d run fly migrate-to-v2 --force-standard-migration. That will cause a couple minutes of downtime, but it’ll get you moved over so you don’t have to worry about any deadlines.

If the downtime won’t work for you, we can look at other options, but I’m inclined to say the simplest option is safest if it won’t cause any significant issues.

Can you try running that again? We’ve had some momentary flickers in the past day or so with volume creation - it might have resolved itself in time :slight_smile:

The same error is returned trying to run fly pg connect. I re-ran flyctl auth login to make sure I recently authenticated.

Error: can't get role for fdda:...:3: 500: context deadline exceeded

I tried again and am still getting the same error. I am in region ewr and I saw a thread about having trouble creating volumes in that region, could that be the problem?

Hi @allison . Is there any update on this by any chance? I am still getting emails saying that time is running out.

I have since retried the command and now I get

Error: 404: 404 page not found

I tried the --force-standard-migration and that errored with the following:

Making snapshots of volumes for the new machines
failed while migrating: failed to create volume: request returned non-2xx status, 503
? Would you like to enter interactive troubleshooting mode? If not, the migration will be rolled back. Yes
failed while troubleshooting: failed to create volume: request returned non-2xx status, 503
Error: failed to create volume: request returned non-2xx status, 503 (Request ID: 01HCR7ZRJHM0C17Z8TM8XFRJN3-lga)