We’ve been running bluegreen deploys successfully on an app that’s deployed into 3 separate Fly apps for a while: prod, and two sandboxes. Since we only need the production instance to be available at all times, we’ve recently turned on auto_stop_machines = "suspend"
and min_machines_running = 0
on the other two. We deploy frequently during the day through GitHub Actions.
Reading Snapshot behavior with suspend, we expect it to be safe to deploy when all machines for an app are suspended
(there would be new stopped
machines instead).
However, these two apps (prereview-sandbox
and prereview-translate
; both with 2 machines in iad
) are now frequently seeing broken deploys:
Creating green machines
Created machine 0802430f5791d8 [app]
Created machine e82547ef303668 [app]
Waiting for all green machines to start
Machine 0802430f5791d8 [app] - created
Machine e82547ef303668 [app] - created
Machine 0802430f5791d8 [app] - started
Machine e82547ef303668 [app] - created
Machine 0802430f5791d8 [app] - started
Machine e82547ef303668 [app] - started
Waiting for all green machines to be healthy
Machine 0802430f5791d8 [app] - unchecked
Machine e82547ef303668 [app] - unchecked
Machine 0802430f5791d8 [app] - 0/1 passing
Machine e82547ef303668 [app] - 0/1 passing
Machine 0802430f5791d8 [app] - 1/1 passing
Machine e82547ef303668 [app] - 0/1 passing
Machine 0802430f5791d8 [app] - 1/1 passing
Machine e82547ef303668 [app] - 1/1 passing
Marking green machines as ready
Machine 0802430f5791d8 [app] now ready
Machine e82547ef303668 [app] now ready
Checkpointing deployment, this may take a few seconds...
Waiting before cordoning all blue machines
Failed to cordon machine 2860d40fe19598 [app]: failed to cordon VM: aborted: machine not in proper state to perform cordon operation
Failed to cordon machine 6839523a736d98 [app]: failed to cordon VM: aborted: machine not in proper state to perform cordon operation
Waiting before stopping all blue machines
Stopping all blue machines
Failed to stop machine 2860d40fe19598 [app]: failed to stop VM 2860d40fe19598: aborted: unable to stop machine, current state invalid, starting
Failed to stop machine 6839523a736d98 [app]: failed to stop VM 6839523a736d98: aborted: unable to stop machine, current state invalid, starting
Waiting for all blue machines to stop
Machine 2860d40fe19598 [app] - started
Machine 6839523a736d98 [app] - started
Error: wait timeout
could not get all blue machines into stopped state
This leaves all the blue and green machines available, and manual intervention is then needed for deploys to work again:
Found 2 different images in your app (for bluegreen to work, all machines need to run a single image)
[x] prereview:[green-image-tag] - 2 machine(s) (e82547ef303668,0802430f5791d8)
[x] prereview:[blue-image-tag] - 2 machine(s) (2860d40fe19598,6839523a736d98)
These image(s) can be safely destroyed:
[x] prereview:[blue-image-tag] - 2 machine(s) ('fly machines destroy --force --image=prereview:[blue-image-tag]')
Here's how to fix your app so deployments can go through:
1. Find all the unwanted image versions from the list above.
Use 'fly machines list' and 'fly releases --image' to help determine unwanted images.
2. For each unwanted image version, run 'fly machines destroy --force --image=<insert-image-version>'
3. Retry the deployment with 'fly deploy'
Error: found multiple image versions
Looking at timestamps, I think it happens when all the blue machines are suspended
.
Have I misunderstood something, or is this a bug?
I’ve been tempted to add a step beforehand to ensure at least one blue machine is started (so it becomes suspended
after the deploy rather than stopped
). (I believe this would also mitigate this problem.)