Summary: A flyctl deploy with the bluegreen strategy ran a textbook-looking deployment — created green machines, marked them healthy/ready, then “Destroying all blue machines” — yet the cluster ended with 0 machines. flyctl exited 0 (reported success); our own post-deploy machine-count check is the only thing that caught it. The green machines flyctl had just created and marked ready were destroyed in the same teardown as blue. The site was down until we manually fly scale count 2.
It’s intermittent: we’ve since tried hard to reproduce it on a disposable test app (slow-to-become-healthy green, stacked deploys, killed-mid-bluegreen deploys, post-ready health flapping) — ~7 variants — and could not trigger this specific “both colours destroyed” outcome. So we’re hoping you can correlate it server-side.
App: growthnation
Config: region lhr, [deploy] strategy = "bluegreen", min_machines_running = 2, auto_stop_machines = "off", kill_timeout = "180s", one HTTP health check on /api/health (grace_period = "15s", interval = "10s")
flyctl: v0.4.x (CI installs latest via superfly/flyctl-actions/setup-flyctl@master)
When: 2026-05-26, ~23:33–23:38 UTC
Deployment image: deployment-01KSK9WNHGNF5NFGWJNKNH0896
Machine event timeline (from fly machine status on each machine):
GREEN — created by THIS deploy, marked ready, then destroyed:
18592e1c6d4708 launch 23:33:22 → started 23:33:59 → uncordon/ready 23:34:17 → DESTROYED 23:37:20
d896d97c09e938 launch 23:33:23 → started 23:34:00 → uncordon/ready 23:34:17 → DESTROYED 23:37:13
BLUE — pre-existing, destroyed by "Destroying all blue machines":
2873519a590368 destroyed 23:37:42
7849201c945948 destroyed 23:37:42
flyctl deploy log (key lines):
23:33:20 Updating existing machines in 'growthnation' with bluegreen strategy
23:33:20 Creating green machines
23:34:05 Waiting for all green machines to be healthy
23:34:17 Marking green machines as ready
23:37:41 Destroying all blue machines
23:37:42 Machine 2873519a590368 [app] destroyed
23:37:42 Machine 7849201c945948 [app] destroyed
(deploy step exits 0 — reported success)
23:38:45 [our post-deploy check] "No machines are available on this app growthnation"
The puzzle: the green machines were marked ready at 23:34:17, then destroyed at 23:37:13–20 — before the “Destroying all blue machines” step at 23:37:42. So both the freshly-deployed green AND the old blue were torn down, leaving zero.
We have ruled out our own CI as the cause:
- We use GitHub Actions (
push: main) withconcurrency: { group: deploy-main, cancel-in-progress: false }, so deploy jobs serialise. This deploy’s job ran 23:25–23:38; the next deploy’s job didn’t start until 23:38:51 — no overlap. - No deploy run was cancelled while running. (We audited every cancelled deploy run in our history — all were cancelled while pending, before the job started, so none ever interrupted a running
flyctl deploy.) - So this deploy ran to completion, uninterrupted, on the latest flyctl — and still self-destructed the cluster.
Questions:
- Under what conditions can a bluegreen deploy destroy the green machines it has just marked ready, in the same run, alongside blue?
- Is there a known race around the “marking green ready” → “destroying blue” transition where green can be torn down — e.g. if a green health check flaps unhealthy in that window?
- Can you correlate this server-side from the app name + deployment image + machine IDs + timestamps above? flyctl printed no Trace ID (the run “succeeded”), so we don’t have one to give you.
Happy to provide anything else. We’ve added post-deploy machine-count auto-recovery on our side so it self-heals now, but we’d like to understand the root cause so we can stop papering over it.