Memory scaling on a single VM causes downtime. Intentional or a bug?

Scenario

  • Your app is running only on a single Fly VM. There are many reasons why it might be the case that you’re not using multiple VMs; a common one being that you just moved over from another provider to Fly, and your app is not (yet) ready to work on multiple VMs side-by-side.
  • max per region is not set. No specific release strategy is set.
  • Your app does not use any volumes.
  • You run fly scale memory <some amount>

Expectation

A new instance of the app is started in the background.
Once its health checks are OK, traffic is no longer forwarded to the old instance but forwarded to the new instance instead.

Only now is the old instance stopped.

In other words: I expect a ‘bluegreen’/‘canary’-style deploy (with only a single VM, I believe these strategies are effectively the same).

What actually happens

The current (old) instance is immediately instructed to stop.

In other words, Fly opts to do a ‘rolling’/‘immediate’-style (with only a single VM, I believe these strategies are effectively the same). deploy instead.

This results in downtime. The downtime is more noticeable if the new VM takes a while to start up.


The weird thing here, is that the situation is very different from what happens when you have 2 or more VMs running. In that case, Fly will opt for a nice ‘canary’-style release where the existing VMs are only stopped once the new one is ready.

But when you only have a single machine, Fly will immediately stop the single currently running VM.

Is this intentional behaviour, or is this a bug?
It seems to me like especially people new to Fly can easily burn themselves on this and cause unintended downtime for their freshly-migrated-to-fly apps.

@qqwy Just to clarify, are you specifying --strategy=canary when you issue your deploy?

@shaun No, this is after an app is deployed using a plain fly deploy without any extra parameters.

This behaviour is reproducible by taking any of the example apps from the documentation, with their default fly.toml and after the first fly launch / fly deploy. (So you’ll then have a cluster with a single node with all other settings at their stock values.)
Calling fly scale memory 512 at that point will shut down the running app before the new version of the app has finished startup.

Got it. Are there volumes tied to this app?

Canary based deploys are only supported for non-volume based apps and should result in zero downtime. Apps with volumes will deploy in a “rolling” fashion and will see downtime when there’s only a single VM.

I hope this clears things up!

No, there are no volumes.

After seeing this with our production app, it was reproducible using the unchanged example Rails, Laravel, Crystal and Static Website apps. (Those are the ones we tried but we assume all of them show this behaviour.)

None of them have volumes.

So I was able to reproduce the issue and VM scale operations do seem to force a rolling deploy as opposed to canary. While the rolling deploy should still be very fast in this instance, it’s not zero downtime. This is certainly something that could be improved on our end.

1 Like

Thank you for confirming that this is not intentional behaviour.

I have created this issue for it.