Fly is stuck replacing VM

Hi,

Fly is constantly failing to replace one of the VMs during deployment. The first time we waited almost for an hour, but it never moved past the

==> Creating release
--> release v241 created
Logs: https://fly.io/apps/obfuscated/monitoring

--> You can detach the terminal anytime without stopping the deployment
==> Monitoring deployment

v241 is being deployed
1b07a5a0: lhr pending
1b07a5a0: lhr pending
1b07a5a0: lhr pending
1b07a5a0: lhr running healthy [health checks: 1 total, 1 passing]

And on the dashboard I can see that the second VM is not replaced:

2 desired, 1 placed, 1 healthy, 0 unhealthy

I tried to stop it with flyctl vm stop 9808785f and redeploy but that didn’t help - the process is stuck again.

==> Creating release
--> release v244 created

--> You can detach the terminal anytime without stopping the deployment
==> Monitoring deployment
Logs: https://fly.io/apps/obfuscated/monitoring

v244 is being deployed
panic: close of closed channel

goroutine 4485 [running]:
github.com/superfly/flyctl/internal/build/imgsrc.(*Resolver).StartHeartbeat.func1()
	github.com/superfly/flyctl/internal/build/imgsrc/resolver.go:488 +0x20
created by time.goFunc
	time/sleep.go:176 +0x38

How do I fix this?

Are you using the max-per-region directive? If so, the deploys could hang if:

  1. Regions have been removed. fly scale count has to be re-run again with the latest and correct count set: Unable to deploy to any region - deployment forever loop - #7 by kurt
  2. Deploying with strategies other than rolling. rolling is the only deploy strat that works when max-per-region is set: Deploy is not going through (pending automatic promotion) - #7 by jerome

If you’re looking to remove max-per-region, see: Deployment hangs forever - #12 by lewis


Other than that, it sounds like something else might have gone wrong; like if you’re using release_commands or volumes or hit “nomad” (Apps v1) bugs etc then historically Fly apps are known to hang about in limbo, which requires Fly engs to intervene and rectify things.

1 Like

Thanks for the suggestions @ignoramous.
I never limited amount of instances per region and don’t remember touching that option even remotely.

flyctl scale show can attest to that

#...
          Count: 2
 Max Per Region: Not set

Due to regulatory obligations, we deployed only to one region - lhr.

Also, release_command is intact, but we use persistent volumes - one per VM.

Update: I scaled instance count down to one and deploy went through. Then I bumped it up to two again and it’s stuck again :angry: :anger:

1 Like