PSA: Machine migration has started again

TLDR: We’re going to start migrating machines with volumes soon. If your application records 6PN addresses and assumes they won’t change forever, you need to update your application to prepare for machine migration.

Since the launch of Apps V2 platform, machines have been strongly tied to their underlying hosts. It is simpler and more predictable than our previous orchestrator. Once your machine is started, it is kept on the host until the machine is explicitly destroyed.

However, this strong coupling between machines and our physical hosts makes it challenging to scale our large fleet of servers without disrupting your workloads. As we scale our fleet, we have new hosts with more redundancy and better performance than older servers that have been running for years. Moving machines off older servers also makes it possible to us upgrade our overall platform and deploy new software fixes and features more quickly.

So, we are building a new capability into our platform - migrating machines from one host to another. This is not “live” migration. We explicitly stop your machine first and create another one on a different host.

A new machine has the same machine ID. Any attached volumes have the same content as the original, although the volume ID will change. The 6PN address will be different from the original machine.

We’ve already started migrating machines without volumes and will migrate machines with volumes soon. Because of the previous Fly Postgres breakage, we are holding off migrating HA Postgres applications, but we are going to support the case in the coming weeks.

What do you need to do?

Please make sure that your application is not directly depending on 6PN addresses’ stability.

  1. Most web applications don’t need to know 6PN addresses at all. They should be fine.
  2. If your application uses our internal DNS to have 6PN addresses on the fly, it should be fine.
  3. If your application records 6PN addresses and assumes they won’t change forever, you need to update your application to prepare for machine migration.

FAQ

How does Fly Postgres handle machine migration?

We are going to issue a few repmgr commands during migration to let the primary know the standby machines’ addresses are changed. While Fly Postgres is not “managed” Postgres (Supabase is!), we will do our best to keep your Postgres running without issues.

How do I know if my machine have been migrated?

We are going to expose machine events from flyctl and/or fly.io/dashboard.

Would machine migration corrupt volumes?

No. Machine migration will stop your machine first to make sure there are no in-flight writes. Please make sure your application handles Unix signals accordingly.

Would machine migration potentially kill in-flight requests?

Like above, machine migration will stop your machine first by sending signals. Please make sure your application handles Unix signals accordingly.

6 Likes

This could have a massive impact to anyone that tracks volume IDs for whatever reason.

This change should be notified by email as well not just by forum post due to how impactful it will be.

EDIT: we received an email from customer success, not sure if it was sent to everyone, though it doesn’t mention changing volume IDs in the email. It has a link to this forum thread for more details.

1 Like

Due to the impact, we have paused machine migration currently. We will update this thread before resuming machine migration.

Yes. We will send email notification to affected customers.

Yes. Let us know if you track volume IDs. I can’t promise that we could keep volume IDs consistent, but I’d like to figure out the migration path.

What happens to fly postgres instances running this image instead of the repmgr one? GitHub - fly-apps/postgres-ha: Postgres + Stolon for HA clusters as Fly apps.

We finally sent the email! So the migration is starting soon.

Starting from 06:00 UTC June 20th, we will begin to migrate Fly Machines on these host servers to a different host server.

The Stolen-based clusters won’t be affected.

I know you’ll be starting this in 2 days but an emergency host maintenance a couple of hours ago broke an app of mine.

This app has a single machine which is in charge of processing queues, triggering jobs, etc.

Some things started failing and it was because this app now had 3 machines running and triggering jobs thrice.

Please take this into account for the migration. If an app has a single machine you should not be creating new machines.