Hi
We’re seeing a persistent issue on our Machines API app (org fremkit-dev) that looks related to Sprites becoming non-responsive after usage and Checkpoint restore causes sprite to vanish, but at the raw Machines API layer rather than the Sprites wrapper. Posting separately because the error signatures and scale differ.
Reproducible pattern
POST /v1/apps/<app>/machines/{id}/startreturns200 OKwithstate=startedPOST /v1/apps/<app>/machines/{id}/execimmediately after fails with one of three errors (below)
Error signatures (all 3 seen in last 24h on different machines)
failed_precondition: machine not running
failed_precondition: exec request failed, VMM not running?
internal: Post "http://unix/v1/exec": could not dial /opt/flyd/firefly/<ULID>/root/control.sock: dial unix /opt/flyd/firefly/<ULID>/root/control.sock: connect: connection refused
The third one is the clearest signal: control plane says started, but the firefly control socket inside the VM never binds (or binds then closes).
Scale
- 269 occurrences in 24h as of 2026-04-16 15:30 UTC
- 10+ distinct machine IDs, multiple regions
- Our end-to-end smoke pass rate dropped from baseline to 0-1/20
What we’ve already tried (based on flyctl’s canonical pattern)
Wait(state=started, timeout=60)with 500ms→2s exp backoff (flyctl’s pattern frominternal/machine/wait.go)- Exec-ready probe via
GetProcesses(/ps), per-machine singleflight + TTL cache - 6-attempt retry on
failed_precondition: machine not runningwith exp backoff 250ms→8s (~16s budget) - Recovery: if state flipped to
stopped/suspended, re-Start; ifstarting, extend wait - Disabled rehttp retry on
POST /execspecifically
Result: /machines/{id} keeps reporting state=started across the full budget, but exec keeps failing. Re-Start during recovery also returns success, but the next exec still fails. Not a client-side race we can retry around — the control plane is reporting ready while the VM/firefly agent underneath isn’t.
Questions
- When
/startor/machines/{id}returnsstate=started, what guarantee (if any) exists about firefly/control-socket readiness? Docs say/startblocks until “VM init responsive” — we see cases where that’s reported true but exec still fails 30+ seconds later. - Is there an observable post-
startedstate/event from our side that indicates control-socket bind completion? - For the
control.sock: connection refusedpath: is there an API-side observable for this, or does it only surface through exec attempts? - Is there a force-reboot/force-recreate endpoint for machines in this stuck state? (same gap requested for sprites in 27137)
Happy to share full log windows, specific machine IDs, or x-fly-request-id values privately if that’s more useful than posting them here. Can also reproduce on-demand via a create/pause/resume cycle.
Thanks!