What happened:
I deployed an update earlier today. During the rolling update, the existing machine entered a Replacing state and never completed. I destroyed the stuck machine and attempted to redeploy, but every new machine Fly creates gets stuck in created state and never reaches running.
Every deploy attempt ends with:
Failed to update machines: failed to update machine <id>: failed to update VM <id>: request returned non-2xx status: 408
I cannot switch regions because my media volume (<vol id>) is locked to DFW.
**What I’ve tried:
**
Destroying stuck machines and redeploying multiple times
fly machine start <id> — returns error: unable to start machine from current state: 'created'
Rolling back to the last known-good deployment — same result
Deploying to ORD — blocked by volume region lock
Volume <vol id> attached to a machine stuck in created
Every deploy creates a new machine that immediately gets stuck in created
The rolling deployment strategy is supposed to keep the old machine running until the new one is healthy — that didn’t happen here, leaving me with no running instance and no way to recover without intervention on the DFW infrastructure side.
Any help getting this unstuck would be appreciated.
I am also having issues with a machine on dwf. It failed overnight and I am not able to recreate it on this region. I am now trying a different region (iad) but also stuck on the starting state. I am not able to start the machine at all. The image I am using is flyio/postgres-flex:17.2
@caike Unrelated to the deployment issues, but worthy of note: in case you are thinking of running a single node of unmanaged Postgres, this is a configuration that Fly recommends against. We’ve had in this forum a smattering of users who run single-node with no backups (and sometimes not even snapshots) and in some cases they’ve lost everything. Unfortunately NVMe failure does sometimes occur on the underlying hosts.
The volume is tied to a specific physical host machine, not just to the region as a whole, so when that one unit has capacity problems, your VM won’t start.
If you look at the logs (fly logs) the next time this happens, you will probably see the detailed complaint: not enough RAM, CPU, etc.
Copying the volume put it on a different physical host machine, one which had more free space.
Generally speaking, you want to avoid having just a single volume on the Fly.io platform. These temporary downtimes are only the beginning of what could wrong, . When the underlying physical host machine fails completely (which it will someday) then you will have permanent data loss.
You mentioned that it was a “media volume” up above, so it would probably be worth looking into whether you could use Tigris instead. If it’s just a place to store uploaded images, etc., then Tigris would be much safer, and the Fly.io platform would then also be free to auto-migrate your Machine, instead of it being a manual-intervention emergency each time…
(I like Fly.io’s volumes myself, but they’re really better suited to those who want to experiment with multi-Machine distributed databases, etc.; they’ve always been a poor fit for less experienced users, in my opinion.)