`state=started` from /start, exec immediately fails with `failed_precondition: machine not running` / control.sock connection refused

Hi

We’re seeing a persistent issue on our Machines API app (org fremkit-dev) that looks related to Sprites becoming non-responsive after usage and Checkpoint restore causes sprite to vanish, but at the raw Machines API layer rather than the Sprites wrapper. Posting separately because the error signatures and scale differ.

Reproducible pattern

  1. POST /v1/apps/<app>/machines/{id}/start returns 200 OK with state=started
  2. POST /v1/apps/<app>/machines/{id}/exec immediately after fails with one of three errors (below)

Error signatures (all 3 seen in last 24h on different machines)

failed_precondition: machine not running
failed_precondition: exec request failed, VMM not running?
internal: Post "http://unix/v1/exec": could not dial /opt/flyd/firefly/<ULID>/root/control.sock: dial unix /opt/flyd/firefly/<ULID>/root/control.sock: connect: connection refused

The third one is the clearest signal: control plane says started, but the firefly control socket inside the VM never binds (or binds then closes).

Scale

  • 269 occurrences in 24h as of 2026-04-16 15:30 UTC
  • 10+ distinct machine IDs, multiple regions
  • Our end-to-end smoke pass rate dropped from baseline to 0-1/20

What we’ve already tried (based on flyctl’s canonical pattern)

  • Wait(state=started, timeout=60) with 500ms→2s exp backoff (flyctl’s pattern from internal/machine/wait.go)
  • Exec-ready probe via GetProcesses (/ps), per-machine singleflight + TTL cache
  • 6-attempt retry on failed_precondition: machine not running with exp backoff 250ms→8s (~16s budget)
  • Recovery: if state flipped to stopped/suspended, re-Start; if starting, extend wait
  • Disabled rehttp retry on POST /exec specifically

Result: /machines/{id} keeps reporting state=started across the full budget, but exec keeps failing. Re-Start during recovery also returns success, but the next exec still fails. Not a client-side race we can retry around — the control plane is reporting ready while the VM/firefly agent underneath isn’t.

Questions

  1. When /start or /machines/{id} returns state=started, what guarantee (if any) exists about firefly/control-socket readiness? Docs say /start blocks until “VM init responsive” — we see cases where that’s reported true but exec still fails 30+ seconds later.
  2. Is there an observable post-started state/event from our side that indicates control-socket bind completion?
  3. For the control.sock: connection refused path: is there an API-side observable for this, or does it only surface through exec attempts?
  4. Is there a force-reboot/force-recreate endpoint for machines in this stuck state? (same gap requested for sprites in 27137)

Happy to share full log windows, specific machine IDs, or x-fly-request-id values privately if that’s more useful than posting them here. Can also reproduce on-demand via a create/pause/resume cycle.

Thanks!

Can you give an example app / machine ID that’s seeing this problem? There could be occasional inconsistencies with the machines API, but those inconsistencies are only superficial and won’t manifest as the machine in question actually being stopped or suspended. I’m wondering if this is related to autostop / autostart (ie if there is no load on your app, and you have autostop configured, then the machine may be stopped almost immediately when it’s started).

No, nothing failed for this machine as it seems. Rather, at the time stamps provided the main process inside the machine exitted and the entire machine restarted due to that. Your machine logs seem to contain some of your implementation details so I cannot share those on the forum, but you can check fly log from your side as well for this machine around the posted timestamp.