Bluegreen deploys seem to break when machines are starting

We’ve been running bluegreen deploys successfully on an app that’s deployed into 3 separate Fly apps for a while: prod, and two sandboxes. Since we only need the production instance to be available at all times, we’ve recently turned on auto_stop_machines = "suspend" and min_machines_running = 0 on the other two. We deploy frequently during the day through GitHub Actions.

Reading Snapshot behavior with suspend, we expect it to be safe to deploy when all machines for an app are suspended (there would be new stopped machines instead).

However, these two apps (prereview-sandbox and prereview-translate; both with 2 machines in iad) are now frequently seeing broken deploys:

Creating green machines
  Created machine 0802430f5791d8 [app]
  Created machine e82547ef303668 [app]

Waiting for all green machines to start
  Machine 0802430f5791d8 [app] - created
  Machine e82547ef303668 [app] - created
  Machine 0802430f5791d8 [app] - started
  Machine e82547ef303668 [app] - created
  Machine 0802430f5791d8 [app] - started
  Machine e82547ef303668 [app] - started

Waiting for all green machines to be healthy
  Machine 0802430f5791d8 [app] - unchecked
  Machine e82547ef303668 [app] - unchecked
  Machine 0802430f5791d8 [app] - 0/1 passing
  Machine e82547ef303668 [app] - 0/1 passing
  Machine 0802430f5791d8 [app] - 1/1 passing
  Machine e82547ef303668 [app] - 0/1 passing
  Machine 0802430f5791d8 [app] - 1/1 passing
  Machine e82547ef303668 [app] - 1/1 passing

Marking green machines as ready
  Machine 0802430f5791d8 [app] now ready
  Machine e82547ef303668 [app] now ready

Checkpointing deployment, this may take a few seconds...

Waiting before cordoning all blue machines
  Failed to cordon machine 2860d40fe19598 [app]: failed to cordon VM: aborted: machine not in proper state to perform cordon operation
  Failed to cordon machine 6839523a736d98 [app]: failed to cordon VM: aborted: machine not in proper state to perform cordon operation

Waiting before stopping all blue machines

Stopping all blue machines
  Failed to stop machine 2860d40fe19598 [app]: failed to stop VM 2860d40fe19598: aborted: unable to stop machine, current state invalid, starting
  Failed to stop machine 6839523a736d98 [app]: failed to stop VM 6839523a736d98: aborted: unable to stop machine, current state invalid, starting

Waiting for all blue machines to stop
  Machine 2860d40fe19598 [app] - started
  Machine 6839523a736d98 [app] - started
Error: wait timeout
could not get all blue machines into stopped state

This leaves all the blue and green machines available, and manual intervention is then needed for deploys to work again:

  Found 2 different images in your app (for bluegreen to work, all machines need to run a single image)
    [x] prereview:[green-image-tag] - 2 machine(s) (e82547ef303668,0802430f5791d8)
    [x] prereview:[blue-image-tag] - 2 machine(s) (2860d40fe19598,6839523a736d98)

  These image(s) can be safely destroyed:
    [x] prereview:[blue-image-tag] - 2 machine(s) ('fly machines destroy --force --image=prereview:[blue-image-tag]')

  Here's how to fix your app so deployments can go through:
    1. Find all the unwanted image versions from the list above.
       Use 'fly machines list' and 'fly releases --image' to help determine unwanted images.
    2. For each unwanted image version, run 'fly machines destroy --force --image=<insert-image-version>'
    3. Retry the deployment with 'fly deploy'

Error: found multiple image versions

Looking at timestamps, I think it happens when all the blue machines are suspended.

Have I misunderstood something, or is this a bug?

I’ve been tempted to add a step beforehand to ensure at least one blue machine is started (so it becomes suspended after the deploy rather than stopped). (I believe this would also mitigate this problem.)

Hey @thewilkybarkid1 thats definitely unusual, as deploying an app with machines in a suspended state is supported. As you mentioned it should deploy the app and create the new machines in a stopped state, with the snapshot being discarded.

Taking a look at your deployment logs it looks like it’s actually not the suspending state that’s causing the problem. From the Stop operation logs, it looks like the machines are actually in a starting state:

Failed to stop machine 2860d40fe19598 [app]: failed to stop VM 2860d40fe19598: aborted: unable to stop machine, current state invalid, starting

The Cordon operation doesn’t log the current state, but the error it returns is consistent with what it would return for a machine in a starting state. The starting states transitional, so it should only be in it for a second or two max.

Are you still seeing these errors with the new machines? If you do see the error again, check the machine events in fly machine status <machine-id> . Are they showing the machine sitting in a starting state for a long time?

1 Like

@thewilkybarkid1 It looks like one machine in each app got wedged in a starting state. If you query your app logs you’ll see a bunch of messages like:

[PM01] machines API returned an error: “machine still attempting to start” from one machine on each. That’d indicate it’s starting, but not able to transition out. It looks like all the machines in your app that threw these have been destroyed, so I’d expect a new deploy to work.

The logs on our end for the last time those machines were stopped have rotated out so I can’t see if anything odd happened when those machines last stopped. I’m speculating, but given it was a suspended machine, I wonder if it had some kind of unclean exit, and the unhealthy state got snapshotted, blocking the start.

If you see it again, running a manual fly machine stop against those machines might work to get them out of that state, otherwise a fly machine destroy or fly machine destroy --force to destroy it, then clone a new one from the existing machine.

1 Like

Thanks for looking in to it @Sam-Fly!

Taking a look at your deployment logs it looks like it’s actually not the suspending state that’s causing the problem. From the Stop operation logs, it looks like the machines are actually in a starting state

:person_facepalming: for me misreading the output. :sweat_smile:

The logs on our end for the last time those machines were stopped have rotated out so I can’t see if anything odd happened when those machines last stopped. I’m speculating, but given it was a suspended machine, I wonder if it had some kind of unclean exit, and the unhealthy state got snapshotted, blocking the start.

Hurray for fly-log-shipper, as they’re still in our Grafana Cloud.

I’ve searched for messages related to 2860d40fe19598, and here’s what I could find that looks relevant (all times BST):

On Sunday it tried to start the instance:

2025-09-14 03:26:10.699 {"event":{"provider":"proxy"}, "message":"Starting machine"}

Other times it’s started within 200/300ms, but this time:

2025-09-14 03:26:15.700 {"event":{"provider":"proxy"}, "message":"[PM03] could not wake up machine due to a timeout requesting from the machines API"}

But the app has already logged to suggest it’s working:

2025-09-14 03:26:13.581 {"event":{"provider":"app"}, "message":"Redis connected"}

It eventually recognises it has started:

2025-09-14 03:28:15.601 {"event":{"provider":"runner"}, "message":"Machine started in 2m4.892s"}

And the app logs to show it’s handling HTTP requests:

2025-09-14 03:32:05.738 {"event":{"provider":"app"}, "message":"Received HTTP request"}

It’s then suspended ok:

2025-09-14 03:35:01.133 {"event":{"provider":"proxy"},"message":"App prereview-translate has excess capacity, autosuspending machine 2860d40fe19598. 1 out of 2 machines left running (region=iad, process group=app)"}

2025-09-14 03:35:02.493 {"event":{"provider":"app"},"message":"Virtual machine has been suspended"}

Next time it starts up the app logs shortly after:

2025-09-14 03:41:40.678 {"event":{"provider":"proxy"},"message":"Starting machine"}

2025-09-14 03:41:41.925 {"event":{"provider":"app"},"message":"Redis connected"}

However, it doesn’t seem to be recognised as started:

2025-09-14 03:41:45.679 {"event":{"provider":"proxy"},"message":"[PM03] could not wake up machine due to a timeout requesting from the machines API"}

It keeps retrying to start it:

2025-09-14 03:41:50.898 {"event":{"provider":"proxy"},"message":"Starting machine"}}

2025-09-14 03:48:56.028 {"event":{"provider":"proxy"},"message":"Starting machine"}}

2025-09-14 03:57:39.928 {"event":{"provider":"proxy"},"message":"Starting machine"}}

2025-09-14 03:57:39.929 {"event":{"provider":"proxy"},"message":"[PM01] machines API returned an error: \"machine still attempting to start\""}

It keeps repeating this, occasionally seeing:

2025-09-14 04:22:21.753 {"event":{"provider":"proxy"},"message":"[PM01] machines API returned an error: \"rate limit exceeded\""}

and also log messages from the app showing it’s still running:

2025-09-14 05:31:35.492 {"event":{"provider":"app"},"message":"Redis connection error"}

2025-09-14 05:31:35.545 {"event":{"provider":"app"},"message":"Redis connected"}

There’s no HTTP requests logged by the app, as presumably it’s not being routed any:

2025-09-14 07:51:49.090 {"event":{"provider":"proxy"},"message":"[PR03] could not find a good candidate within 2 attempts at load balancing. last error: [PM01] machines API returned an error: \"machine still attempting to start\"}

There’s no log messages relating to the original bluegreen deploy, then eventually my fly machine destroy causes the app to be stopped and the machine destroyed.

2025-09-15 10:24:17.152 {"event":{"provider":"app"},"message":"\u001b[32m INFO\u001b[0m Sending signal SIGINT to main child process w/ PID 637"}

So there is no Machine started in ... message for the start at 2025-09-14 03:41:40.678. (The one after the timed-out attempt which took 2m4.892s.) It’s stuck in the starting state, even though the app seems to be running ok inside.

Hope that’s useful. :sweat_smile:

1 Like

@Sam-Fly Looks like it’s happened again, machine 6e82744dc65638 has been stuck starting since it was woken up at 2025-09-20 04:43:52 UTC, and a bluegreen deployment has now left both sets of machines running. I’ve left them for now since it’s non-production, but I’ll have to destroy them tomorrow morning UK time to unblock deployments.