Memory scaling on a single VM causes downtime. Intentional or a bug?

qqwy · August 13, 2022, 9:21am

Scenario

Your app is running only on a single Fly VM. There are many reasons why it might be the case that you’re not using multiple VMs; a common one being that you just moved over from another provider to Fly, and your app is not (yet) ready to work on multiple VMs side-by-side.
max per region is not set. No specific release strategy is set.
Your app does not use any volumes.
You run fly scale memory <some amount>

Expectation

A new instance of the app is started in the background.
Once its health checks are OK, traffic is no longer forwarded to the old instance but forwarded to the new instance instead.

Only now is the old instance stopped.

In other words: I expect a ‘bluegreen’/‘canary’-style deploy (with only a single VM, I believe these strategies are effectively the same).

What actually happens

The current (old) instance is immediately instructed to stop.

In other words, Fly opts to do a ‘rolling’/‘immediate’-style (with only a single VM, I believe these strategies are effectively the same). deploy instead.

This results in downtime. The downtime is more noticeable if the new VM takes a while to start up.

The weird thing here, is that the situation is very different from what happens when you have 2 or more VMs running. In that case, Fly will opt for a nice ‘canary’-style release where the existing VMs are only stopped once the new one is ready.

But when you only have a single machine, Fly will immediately stop the single currently running VM.

Is this intentional behaviour, or is this a bug?
It seems to me like especially people new to Fly can easily burn themselves on this and cause unintended downtime for their freshly-migrated-to-fly apps.

shaun · August 15, 2022, 7:29pm

@qqwy Just to clarify, are you specifying --strategy=canary when you issue your deploy?

qqwy · August 15, 2022, 7:49pm

@shaun No, this is after an app is deployed using a plain fly deploy without any extra parameters.

This behaviour is reproducible by taking any of the example apps from the documentation, with their default fly.toml and after the first fly launch / fly deploy. (So you’ll then have a cluster with a single node with all other settings at their stock values.)
Calling fly scale memory 512 at that point will shut down the running app before the new version of the app has finished startup.

shaun · August 15, 2022, 8:37pm

Got it. Are there volumes tied to this app?

Canary based deploys are only supported for non-volume based apps and should result in zero downtime. Apps with volumes will deploy in a “rolling” fashion and will see downtime when there’s only a single VM.

I hope this clears things up!

qqwy · August 15, 2022, 8:40pm

No, there are no volumes.

After seeing this with our production app, it was reproducible using the unchanged example Rails, Laravel, Crystal and Static Website apps. (Those are the ones we tried but we assume all of them show this behaviour.)

None of them have volumes.

shaun · August 15, 2022, 11:22pm

So I was able to reproduce the issue and VM scale operations do seem to force a rolling deploy as opposed to canary. While the rolling deploy should still be very fast in this instance, it’s not zero downtime. This is certainly something that could be improved on our end.

qqwy · August 16, 2022, 12:27pm

Thank you for confirming that this is not intentional behaviour.

I have created this issue for it.

Topic		Replies	Views
fly scale vm causing downtime Questions / Help	4	342	July 12, 2022
how do fly.io deploys work?	11	2622	November 2, 2022
Request: Autoscale without killing all VMs to do so	9	575	August 27, 2020
No suitable (healthy) instance found to handle request	9	330	October 28, 2021
Two or more machines, all placed on a single (failed) host?	2	42	August 26, 2024

Memory scaling on a single VM causes downtime. Intentional or a bug?

Scenario

Expectation

What actually happens

Related topics