Deployment and rollback failed with: Failed due to unhealthy allocations

Hey,

Our app currently fails to deploy with the error: “Failed due to unhealthy allocations” this also was the error when it tried to rollback. Can some ASAP look into this? I assume this is a platform issue?

It’s the *-dev-backend machine.

Thanks,
Florian

Based on some more looking around this very much looks like previous reported issues where a VM might not have been shutdown correctly and because of volume the new VM can’t start up?

I definitely need someone from Fly to take a look at this as this is currently blocking dev deployments which in return blocks important prod deployments…

New update (not sure if someone from Fly did something) but after multiple revert attempts I assume it finally is up and running again.

There is definitely something strange going on here.

This looks like a delay in scheduling caused by temporary capacity issues. The host your volume is attached to had a burst of usage, when you deployed it stopped the previous VM, but then couldn’t reserve space to start the new VM. After some time, the capacity pressure cleared and Nomad was able to start a new VM.

This is a rough edge case for Nomad apps. There are two ways you may be able to workaround this problem:

  • Run two VMs + Volumes at all times. If you care about uptime for your application, you should run 2+ instances. If you can tolerate issues like this, one instance is fine
  • Run a Machine based app instead. The way machines are architected mitigates this a little. Updating an existing Machine doesn’t do the whole capacity dance, it just restarts. In this particular situation, a Machine would have updated just fine. That’s not always true, the first bullet is still the most reliable.

As an aside, when you need help with a specific app, the forums may not work well. We don’t see every thread here. For support for apps you care about, the launch plan + email support will work a lot better.

1 Like

Is there an easy way to migrate a nomad app to a machine app? or would I have to delete the nomad app and then re-create the new app?

It also looks like the issue is not gone? We tried re-deploy the new version and are again stuck with the same error by the looks of it.