How we migrate your Machines

andie · June 20, 2024, 8:37pm

We posted here a few weeks ago about migrating Machines from one host to another. If you have Machines on affected hosts, we’ve also sent you one or more emails to give you a heads-up about what to expect.

The main thing is, you don’t need to take any action to migrate your Machines to another host and (if your app has multiple Machines) migration shouldn’t result in any downtime for your app. The other main thing is that we’re doing this to keep your apps and Machines healthy and ready to run when you need them.

We have a new reference doc that outlines the process of migrating Machines and shows how you can check if your Machine was migrated:

pier · June 21, 2024, 4:31pm

Thanks for the docs.

It would have been great if we could have triggered the migration to a new host ourselves. This would allow us to warn our users of any downtime, put apps in maintenance mode, pick the best time for our use case, put everyone in “all hands on deck” situation, etc.

If after some time there are still apps that haven’t been migrated, then do it automatically.

As it is, we have no idea when it will happen. If anything goes wrong in the middle of the night we might not be able to react until hours later.

kaz · June 21, 2024, 5:49pm

Thanks for the feedback!

How do you handle machine failures in general? Ideally speaking, losing one machine shouldn’t cause any customer impacts. Hardware sometimes fail and we generally recommend having multiple machines to handle such a case.

If after some time there are still apps that haven’t been migrated, then do it automatically.

What would be the reasonable timeout in your case? We can send some sort of events but we probably can’t wait, let’s say, 30 days

pier · June 21, 2024, 6:56pm

All of our Fly apps in production except one have multiple machines. Our downtime has been caused by other issues, not machine failure.

That app with a single machine does not respond to HTTP and the [[services]] section is empty. It basically responds to triggers in the DB and performs long runnning and scheduled jobs. If it goes offline for less than a minute it’s not critical but it’s not designed to work concurrently. See this incident from a couple of days ago.

Yeah 30 days is too much. A week maybe?

If it was up to us, we would have migrated to the new host 1-2 days after we received the email.

When the migration officially started we stayed up until 3am thinking it would be a faster process Even 2 days later it seems none of our apps have been migrated.