App down in DFW: machine stuck in `created` state, 408 timeouts on every deploy attempt

App name: ex-libris-latti
Region: DFW

What happened:
I deployed an update earlier today. During the rolling update, the existing machine entered a Replacing state and never completed. I destroyed the stuck machine and attempted to redeploy, but every new machine Fly creates gets stuck in created state and never reaches running.

Every deploy attempt ends with:

Failed to update machines: failed to update machine <id>: failed to update VM <id>: request returned non-2xx status: 408

I cannot switch regions because my media volume (<vol id>) is locked to DFW.

**What I’ve tried:
**
Destroying stuck machines and redeploying multiple times

  • fly machine start <id> — returns error: unable to start machine from current state: 'created'
  • Rolling back to the last known-good deployment — same result
  • Deploying to ORD — blocked by volume region lock
  • Checked status.fly.io — no reported incidents

Current state:

  • No machines running — app is completely down
  • Volume <vol id> attached to a machine stuck in created
  • Every deploy creates a new machine that immediately gets stuck in created

The rolling deployment strategy is supposed to keep the old machine running until the new one is healthy — that didn’t happen here, leaving me with no running instance and no way to recover without intervention on the DFW infrastructure side.

Any help getting this unstuck would be appreciated.

Hi… Forking (i.e., copying) the volume is the recommended workaround in this situation:

https://community.fly.io/t/insufficient-capacity-in-iad/27499/3

That would be fly vol fork --region ord, if you wanted to give Chicago a try again…

I am also having issues with a machine on dwf. It failed overnight and I am not able to recreate it on this region. I am now trying a different region (iad) but also stuck on the starting state. I am not able to start the machine at all. The image I am using is flyio/postgres-flex:17.2

@caike Unrelated to the deployment issues, but worthy of note: in case you are thinking of running a single node of unmanaged Postgres, this is a configuration that Fly recommends against. We’ve had in this forum a smattering of users who run single-node with no backups (and sometimes not even snapshots) and in some cases they’ve lost everything. Unfortunately NVMe failure does sometimes occur on the underlying hosts.

Thanks. Looks like my machine issue was indeed related to a volume failure. Creating a fresh volume appears to have solved it.

Forked volume to ord and then deployed to ord. Success!
Many thanks!

Any idea why this worked? I am, at best, a hobbyist and dev ops is definitely my biggest weakness.

1 Like

Glad to hear that it worked!

The volume is tied to a specific physical host machine, not just to the region as a whole, so when that one unit has capacity problems, your VM won’t start.

If you look at the logs (fly logs) the next time this happens, you will probably see the detailed complaint: not enough RAM, CPU, etc.

Copying the volume put it on a different physical host machine, one which had more free space.

Generally speaking, you want to avoid having just a single volume on the Fly.io platform. These temporary downtimes are only the beginning of what could wrong, :dragon:. When the underlying physical host machine fails completely (which it will someday) then you will have permanent data loss.

You mentioned that it was a “media volume” up above, so it would probably be worth looking into whether you could use Tigris instead. If it’s just a place to store uploaded images, etc., then Tigris would be much safer, and the Fly.io platform would then also be free to auto-migrate your Machine, instead of it being a manual-intervention emergency each time…

(I like Fly.io’s volumes myself, but they’re really better suited to those who want to experiment with multi-Machine distributed databases, etc.; they’ve always been a poor fit for less experienced users, in my opinion.)

2 Likes