Unable to deploy successfully, seeing some strange behaviour with flyctl

Hi @fly peeps!

I’m having trouble deploying currently using flyctl. I’m at the latest version. When I try to deploy, I’m seeing something like:

==> Creating release
Release v7078 created

You can detach the terminal anytime without stopping the deployment
Monitoring Deployment

30 desired, 18 placed, 8 healthy, 1 unhealthy [health checks: 18 total, 18 passing]
v7078 failed - Failed due to unhealthy allocations - rolling back to job version 7077

23 desired, 5 placed, 0 healthy, 1 unhealthy [health checks: 2 total, 1 passing]
v7079 failed - Failed due to unhealthy allocations - not rolling back to stable job version 7079 as current job has same specification
Failed Instances

I’m not able to see why it is showing as unhealthy, so it’s difficult to fix.

The strange behaviour I’m seeing (via fly status --watch) is that the deploy will fail, and then the number of instances running (target is 30) will slowly drain down to 3 or 4, and then I deploy again they’ll shoot back up to ~30 of the previous version which will start to be replaced with the new version… which fails, and then they start to drop off again.

Your help would be greatly appreciated, as right now the app/site is down.

Update: it’s now back up, but running version 7078, which showed as failed to deploy. Not sure if this is a bug with my code or something to do with the deploy itself.

These were likely transient failures on our end. We’ve had heisenbugs causing VM failures on busy, global apps this week. Especially running in Chennai and Sydney.

You can add this to your config to prevent rollbacks, which will help some:

[experimental]
auto_rollback = false

If a random VM failure happens with that set, the deploy will stop but be staged. You can run fly vm stop <id> for any older instances that are still running when that happens.

1 Like

Ah, thanks for the heads-up. I was planning to do a number of deploys this week to test out various tuneable settings, but I’ll hold off. Sorry to ask the annoying question, but do you have any idea when things might get a little more stable on your side?

Oh, FYI, I tried fly scale count=20 down from 30, and it took the app offline again (release 7080). A redeploy from my side (7081) seems to have brought everything back up. This also seems to be the first “successful” release I’ve attempted:

20 desired, 20 placed, 20 healthy, 0 unhealthy [health checks: 40 total, 40 passing]
--> v7081 deployed successfully

The docker images I’m pushing are changing fly.toml timeouts and some CSS, so they’re pretty consistent.

Don’t know if that info will help identify these bugs. I’ll try to leave things alone now till next week.