bluegreen deploy reported success but destroyed BOTH green and blue machines → 0 machines

Summary: A flyctl deploy with the bluegreen strategy ran a textbook-looking deployment — created green machines, marked them healthy/ready, then “Destroying all blue machines” — yet the cluster ended with 0 machines. flyctl exited 0 (reported success); our own post-deploy machine-count check is the only thing that caught it. The green machines flyctl had just created and marked ready were destroyed in the same teardown as blue. The site was down until we manually fly scale count 2.

It’s intermittent: we’ve since tried hard to reproduce it on a disposable test app (slow-to-become-healthy green, stacked deploys, killed-mid-bluegreen deploys, post-ready health flapping) — ~7 variants — and could not trigger this specific “both colours destroyed” outcome. So we’re hoping you can correlate it server-side.

App: growthnation
Config: region lhr, [deploy] strategy = "bluegreen", min_machines_running = 2, auto_stop_machines = "off", kill_timeout = "180s", one HTTP health check on /api/health (grace_period = "15s", interval = "10s")
flyctl: v0.4.x (CI installs latest via superfly/flyctl-actions/setup-flyctl@master)
When: 2026-05-26, ~23:33–23:38 UTC
Deployment image: deployment-01KSK9WNHGNF5NFGWJNKNH0896

Machine event timeline (from fly machine status on each machine):

GREEN — created by THIS deploy, marked ready, then destroyed:
  18592e1c6d4708   launch 23:33:22 → started 23:33:59 → uncordon/ready 23:34:17 → DESTROYED 23:37:20
  d896d97c09e938   launch 23:33:23 → started 23:34:00 → uncordon/ready 23:34:17 → DESTROYED 23:37:13

BLUE — pre-existing, destroyed by "Destroying all blue machines":
  2873519a590368   destroyed 23:37:42
  7849201c945948   destroyed 23:37:42

flyctl deploy log (key lines):

23:33:20  Updating existing machines in 'growthnation' with bluegreen strategy
23:33:20  Creating green machines
23:34:05  Waiting for all green machines to be healthy
23:34:17  Marking green machines as ready
23:37:41  Destroying all blue machines
23:37:42    Machine 2873519a590368 [app] destroyed
23:37:42    Machine 7849201c945948 [app] destroyed
          (deploy step exits 0 — reported success)
23:38:45  [our post-deploy check] "No machines are available on this app growthnation"

The puzzle: the green machines were marked ready at 23:34:17, then destroyed at 23:37:13–20before the “Destroying all blue machines” step at 23:37:42. So both the freshly-deployed green AND the old blue were torn down, leaving zero.

We have ruled out our own CI as the cause:

  • We use GitHub Actions (push: main) with concurrency: { group: deploy-main, cancel-in-progress: false }, so deploy jobs serialise. This deploy’s job ran 23:25–23:38; the next deploy’s job didn’t start until 23:38:51 — no overlap.
  • No deploy run was cancelled while running. (We audited every cancelled deploy run in our history — all were cancelled while pending, before the job started, so none ever interrupted a running flyctl deploy.)
  • So this deploy ran to completion, uninterrupted, on the latest flyctl — and still self-destructed the cluster.

Questions:

  1. Under what conditions can a bluegreen deploy destroy the green machines it has just marked ready, in the same run, alongside blue?
  2. Is there a known race around the “marking green ready” → “destroying blue” transition where green can be torn down — e.g. if a green health check flaps unhealthy in that window?
  3. Can you correlate this server-side from the app name + deployment image + machine IDs + timestamps above? flyctl printed no Trace ID (the run “succeeded”), so we don’t have one to give you.

Happy to provide anything else. We’ve added post-deploy machine-count auto-recovery on our side so it self-heals now, but we’d like to understand the root cause so we can stop papering over it.

Hi Ben,

Under what conditions can a bluegreen deploy destroy the green machines it has just marked ready, in the same run, alongside blue?

This shouldn’t happen under any circumstances. If the subsequent operations with blue machines (cordon, destroy) fail, flyctl should just exit and leave all machines in place.

Is there a known race around the “marking green ready” → “destroying blue” transition where green can be torn down — e.g. if a green health check flaps unhealthy in that window?

No. Also, given the time that passed (3 minutes) it doesn’t sound like a race condition, which usually exhibits much closer timing. The 3-minute time is exactly the value you configured for kill_timeout, and checking logs for one of the destroyed machines, it seems it was gracefully stopped and then destroyed, which is not something that should happen as a result of a bg deploy.

Can you correlate this server-side from the app name + deployment image + machine IDs + timestamps above? flyctl printed no Trace ID (the run “succeeded”), so we don’t have one to give you.

The logs I have show what happened (machine stopped, then was destroyed) but don’t show who/why triggered this action. It’s not the Fly platform, since the bg deployment is entirely driven by flyctl client-side, so that won’t be logged. The most likely explanation is that something else stopped/destroyed these machines. Are you running parallel deploys (e.g. conflicting CI jobs)? Did someone else run a manual deploy? (these are unlikely, the symptoms would be different). Do you use some kind of auto-scaler process that creates/destroys machines based on metrics, request load, etc?

Thanks for your reply!

Very useful context.

We have a number of angles we are investigating based on this :

  • Graceful shutdown handler / signal / timeout
  • Github actions had ‘cancel in progress’ concurrency which perhaps leads to cancellation and then immediate execution
  • etc