Machine stuck in "created" for 2h20m before booting (IAD), sibling machines started in seconds

Hi all — looking for help understanding a Machine that sat in `created` for ~2h20m before it finally booted, while sibling Machines created minutes later (same app, same region) started in seconds. I'd like to understand the cause and the right way to detect/recover from it.

## Environment
- **App:** `zeroclaw-executor-prod`
- **Region:** `iad`
- Machines are created on demand via the Machines API (`POST /v1/apps/{app}/machines`), one per job. We do **not** pass `skip_launch`, so we expect Fly to pull the image and auto-start the Machine on create.
- Per-machine config (env redacted):
  ```json
  {
    "region": "iad",
    "config": {
      "image": "<our executor image>",
      "guest": { "cpus": 4, "memory_mb": 32768, "cpu_kind": "performance" },
      "auto_destroy": true,
      "restart": { "policy": "no" },
      "env": { "...redacted..." }
    }
  }
  ```

## What happened
The Machine was created successfully (the API returned a machine id immediately and it sat in `created`), but it did **not** transition to `started` for ~2h20m. It then booted on its own and ran to completion normally.

- **Machine ID:** `85e209f4471518`
- **2026-06-22 14:58:49 UTC** — `POST /machines` succeeded; Machine in `created`.
- **2026-06-22 17:19:31 UTC** — Machine finally booted; our process started for the first time.
- **2026-06-22 18:14:22 UTC** — workload finished and the Machine `auto_destroy`ed cleanly (`exit_code=0`).

➡️ **~2h20m42s stuck in `created`** before the first boot. Once it booted, it worked fine.

## Control group (same app + region, around the same time)
These started normally, which is why I believe the problem was isolated to that one Machine rather than an app/region/account-wide issue:

| Machine (job) | Created (UTC) | First boot (UTC) | Time to boot |
|---|---|---|---|
| `85e209f4471518` | 14:58:49 | **17:19:31** | **~2h20m** |
| sibling A | 15:19:35 | 15:21:34 | ~2m |
| sibling B | 16:09:31 | 16:09:45 | **~14s** |
| sibling C (deploy) | 16:51:57 | 16:53:24 | ~1.5m |

Same `iad`, same app, same image, same guest size — only `85e209f4471518` stalled.

## Logs (application side)
```
14:58:37  Creating executor Machine for project=… (12 issues)
14:58:49  Created executor Machine 85e209f4471518 (exec-…-40323) — expecting Fly to auto-start after image pull
          … no boot for ~2h20m …
17:19:31  Executor process starting (first start inside the machine)
18:14:22  Fly Machine 85e209f4471518 reached terminal state=destroyed exit_code=0
```
We only have application-side logs; we don't have the Machine's internal state-transition events because it was `auto_destroy`ed. The machine id + app + timestamps above should let you look up the host/placement events on your side.

## Questions
1. What can cause a Machine to remain in `created` for 2h+ before booting, then start on its own? Stuck image pull, placement/capacity wait, or a host-level issue?
2. Could anything in our create call (`auto_destroy: true`, `restart.policy: "no"`, performance-4x / 32 GB guest) delay placement?
3. Is there a supported way to detect that a `created → started` transition is stalling (an event/state we can poll), so we can destroy + recreate instead of waiting? Today we poll `GET /machines/{id}` `state`, but a Machine stuck in `created` looks "in progress," so we can't tell "still pulling image" from "stuck."
4. Was there any known placement/capacity event in `iad` around **2026-06-22 14:58–17:19 UTC**?

Thanks in advance — happy to provide more detail.

Hi! The machine gets created but isn’t ready to boot until its rootfs is prepared, which requires pulling the image from registry and unpacking it on disk. If these operations are slow, then that can explain the 2h delay. 2h is totally abnormal and excessive though. We’re looking at this and a couple other reports of machine creation slowness in iad today so we can fix this, but one thing you can do is use a different region. or add a bit of logic to more quickly time out and fall back to creating another machine instead.

Nothing in the create call would have caused this delay - it was 100% slow machine preparation.

There’s no real way to detect a stall here - you can call the wait api endpoint with a time you’re willing to wait (https://fly.io/docs/machines/api/machines-resource/#wait-for-a-machine-to-reach-a-specified-state). If it times out, fly machine destroy --force it and create a new one. If your machine creates are time-sensitive, you can combine a bit of logic that does this with perhaps a pool of pre-created machines so the impact of slow creates happens outside your critical path.

Let me know if this helps!

Thanks for confirming — good to know it’s a broader IAD issue you’re already tracking, and that nothing in our create call was at fault.

On our side we’ll add a timeout + fallback: use the Machines wait?state=started endpoint with a bounded timeout, and if it’s exceeded, destroy --force and recreate (our sibling machines booted in seconds, so a recreate should recover quickly). We’ll keep a region switch as a secondary fallback for time-sensitive jobs.

If it helps correlate the IAD investigation, here are the relevant machines in zeroclaw-executor-prod from that window (all iad, same image/guest):

  • Stuck: 85e209f4471518 — created 2026-06-22 14:58:49 UTC, first boot 17:19:31 UTC (~2h20m).
  • Healthy siblings created around the same time booted in ~14s–2m.

Happy to share more machine IDs / timestamps if useful. Would appreciate a note here once the IAD fix lands. Thanks again for the quick response!