PSA: Machine migration has started again

TLDR: We’re going to start migrating machines with volumes soon. If your application records 6PN addresses and assumes they won’t change forever, you need to update your application to prepare for machine migration.

Since the launch of Apps V2 platform, machines have been strongly tied to their underlying hosts. It is simpler and more predictable than our previous orchestrator. Once your machine is started, it is kept on the host until the machine is explicitly destroyed.

However, this strong coupling between machines and our physical hosts makes it challenging to scale our large fleet of servers without disrupting your workloads. As we scale our fleet, we have new hosts with more redundancy and better performance than older servers that have been running for years. Moving machines off older servers also makes it possible to us upgrade our overall platform and deploy new software fixes and features more quickly.

So, we are building a new capability into our platform - migrating machines from one host to another. This is not “live” migration. We explicitly stop your machine first and create another one on a different host.

A new machine has the same machine ID. Any attached volumes have the same content as the original, although the volume ID will change. The 6PN address will be different from the original machine.

We’ve already started migrating machines without volumes and will migrate machines with volumes soon. Because of the previous Fly Postgres breakage, we are holding off migrating HA Postgres applications, but we are going to support the case in the coming weeks.

What do you need to do?

Please make sure that your application is not directly depending on 6PN addresses’ stability.

  1. Most web applications don’t need to know 6PN addresses at all. They should be fine.
  2. If your application uses our internal DNS to have 6PN addresses on the fly, it should be fine.
  3. If your application records 6PN addresses and assumes they won’t change forever, you need to update your application to prepare for machine migration.

FAQ

How does Fly Postgres handle machine migration?

We are going to issue a few repmgr commands during migration to let the primary know the standby machines’ addresses are changed. While Fly Postgres is not “managed” Postgres (Supabase is!), we will do our best to keep your Postgres running without issues.

How do I know if my machine have been migrated?

We are going to expose machine events from flyctl and/or fly.io/dashboard.

Would machine migration corrupt volumes?

No. Machine migration will stop your machine first to make sure there are no in-flight writes. Please make sure your application handles Unix signals accordingly.

Would machine migration potentially kill in-flight requests?

Like above, machine migration will stop your machine first by sending signals. Please make sure your application handles Unix signals accordingly.

7 Likes

This could have a massive impact to anyone that tracks volume IDs for whatever reason.

This change should be notified by email as well not just by forum post due to how impactful it will be.

EDIT: we received an email from customer success, not sure if it was sent to everyone, though it doesn’t mention changing volume IDs in the email. It has a link to this forum thread for more details.

1 Like

Due to the impact, we have paused machine migration currently. We will update this thread before resuming machine migration.

Yes. We will send email notification to affected customers.

Yes. Let us know if you track volume IDs. I can’t promise that we could keep volume IDs consistent, but I’d like to figure out the migration path.

What happens to fly postgres instances running this image instead of the repmgr one? GitHub - fly-apps/postgres-ha: Postgres + Stolon for HA clusters as Fly apps.

We finally sent the email! So the migration is starting soon.

Starting from 06:00 UTC June 20th, we will begin to migrate Fly Machines on these host servers to a different host server.

The Stolen-based clusters won’t be affected.

I know you’ll be starting this in 2 days but an emergency host maintenance a couple of hours ago broke an app of mine.

This app has a single machine which is in charge of processing queues, triggering jobs, etc.

Some things started failing and it was because this app now had 3 machines running and triggering jobs thrice.

Please take this into account for the migration. If an app has a single machine you should not be creating new machines.

1 Like

While Fly Postgres is not “managed” Postgres (Supabase is!),

I would love to use Supabase however the docs state;

This service is in public alpha. Do not run production workloads of any kind!

So… I’ll get there when you get there?

2 Likes

Ok I just migrated my old Stolon Postgres database manually to try and avoid any issues when the automatic migration came through. I had a read replica in my cluster that was already migrated but my leader wasn’t. So I needed to force the leader to change.

Here’s what I did:

  1. Scale the cluster up to 3 machines (I had two previously)
  2. Take a database backup offsite
  3. Ssh into your master and follow this guide - What is the correct process to change the postgres leader region? - #2 by shaun . I followed the stolonctl commands.
  4. The failover began and allocated one of the recently migrated machines to be the leader.
  5. I stopped the old server (non migrated one)
  6. Waited for the new replica to become healthy

Everything worked out there was ~20 minutes of downtime, this is dependent on your volume size. At first I didn’t turn off the previous leader and the replica sync failed.

1 Like

Hi Pier, migrations will move a Machine, but it will never create a new Machine. The number of live Machines will briefly decrease by one and then return to the original number, but at no part of the migration sequence will a Machine be alive in two places at once.

Actually, nothing at all in the Fly Platform should be creating Machines except by your command. Care to share more about your situation where you unexpectedly had three Machines where you thought you had one?

1 Like

Yes, that’s what anyone would expect and yet two extra machines were created without our command.

Care to share more about your situation where you unexpectedly had three Machines where you thought you had one?

You probably have logs about what happened? I can send you the app id to an email if you want to check it out.

We received an email that there was some host maintenance going on with the affected app and machine id. I checked the app was running so assumed the maintenance had come and gone without us noticing.

Then the issues on our end started. It took over an hour until we figured out what was going on and we killed the 2 extra machines.

I can confirm 100% we didn’t create any new machines. Neither via the CLI or the API. We don’t create machines automatically in any of our Fly apps. At some point in the midst of the panic we deployed that app to add some logs but nothing else.

Why would we create new machines in an app that was specifically designed to not work concurrently?

Regarding the current migration happening in a couple of hours… is there a way to check if a machine or app has been migrated? Asking because we will put our public facing apps in maintenance mode until we’re certain everything has been migrated correctly.

Are the events to the dashboard working?

Still no new events in any of our apps after +2 hours.

Hi @pier ! Yes, you will be able to see when a machine has been migrated, either in the dashboard or flyctl machine status.

In the events table for the machine, you’ll see an event of type launch with state created, source flyd and in the Info column migrated=true

1 Like

Thanks @aschiavo !

Is there any ETA for the migration to be completed?

Hi @pier , I’m afraid I don’t have an ETA for when your machines will be migrated. We have a long list of hosts we want to decommission as soon as possible, but we want to do it safely, so we are not going to rush the process and migrate everything at once.

1 Like

Yes, if you could post here or send to support@fly.io whatever details you have about app name, machine IDs, time this happened, etc. This sounds extremely unusual and I absolutely will investigate.

1 Like

Gracias Andrés.

For the next time, it would be great if each customer was notified before their machines started migrating. Or at least have some kind of timeframe to be able to monitor our apps and check everything is working as expected.

To be honest we had no idea how long this migration would take or what to expect. Is this going to take hours? Days? Weeks?

I will. Thanks for looking into this.

Hi @pier, did you send the info about this unexpected Machine creation event to any fly.io email? I’m happy to investigate what happened, but I’ve searched and can’t find any email from you in any fly.io inbox. If you have a personal support email, you can send it there, or to support@fly.io, or if you’re okay with posting information publicly, you can add it here to this thread.

Sorry John, I still haven’t sent the info. I’ll ping you here once I’ve done that.

@john-fly I just sent an email to support with some info.

Thanks for checking this out.

Edit:

Just received an automated response saying support@fly.io is unmonitored. Can you confirm you received the email?

Hi Pier, no, thank you for sending this; this is definitely a thing we’d like to investigate. Yes, support@fly.io is unmonitored in the sense that it’s no one’s job to watch it (users with paid plans have their own custom support addresses) but the mail still lands in an inbox, and I fished it out. Taking an immediate look at the backend, I do see records of two Machines in that App that were created and destroyed on the date you say. I’ll try to dig more into why soon.

1 Like