Deployment Errors - Failed due to unhealthy allocations

Our App has been deploying fine but today we made a minor change and it now refuses to deploy with Failed due to unhealthy allocations

The error seems to happen straight away before the App even has anytime to boot up and I don’t see any logs at all.

I am now starting to think this might be something up on the Fly end?
Is LHR down and causing issues maybe?

fly status --all

App
  Name     = portal-site          
  Owner    = portal               
  Version  = 71                   
  Status   = running              
  Hostname = portal-site.fly.dev  

Deployment Status
  ID          = e8b2ad96-5f18-a6a9-0e29-4e7cc82654ab                                                                                   
  Version     = v71                                                                                                                    
  Status      = failed                                                                                                                 
  Description = Failed due to unhealthy allocations - not rolling back to stable job version 71 as current job has same specification  
  Instances   = 3 desired, 1 placed, 0 healthy, 1 unhealthy                                                                            

Instances
ID       PROCESS VERSION REGION DESIRED STATUS  HEALTH CHECKS      RESTARTS CREATED    
82f10aa0 app     71 ⇡    lhr    stop    failed                     0        1m51s ago  
a1af5cf2 app     70      lhr    stop    failed                     0        4m43s ago  
358df4a0 app     69      lhr    stop    failed                     0        4m51s ago  
de21f406 app     68      lhr    stop    failed                     0        16m54s ago 
27cab205 app     67      atl    run     running 1 total, 1 passing 0        12h20m ago 
3e7ecfe3 app     67      lax    run     running 1 total, 1 passing 0        12h21m ago 
ad26704d app     67      ams(B) run     running 1 total, 1 passing 0        12h22m ago 
e2a40443 app     66      lhr    stop    failed                     0        9h39m ago  

fly vm status 82f10aa0

Instance
  ID            = 82f10aa0   
  Process       =            
  Version       = 71         
  Region        = lhr        
  Desired       = stop       
  Status        = failed     
  Health Checks =            
  Restarts      = 0          
  Created       = 6m54s ago  

Recent Events
TIMESTAMP            TYPE            MESSAGE                                                                    
2021-10-08T22:55:26Z Received        Task received by client                                                    
2021-10-08T22:55:26Z Task Setup      Building Task Directory                                                    
2021-10-08T22:55:31Z Driver Failure  failed to start task after driver exited unexpectedly: plugin is shut down 
2021-10-08T22:55:31Z Not Restarting  Error was unrecoverable                                                    
2021-10-08T22:55:31Z Alloc Unhealthy Unhealthy because of failed task                                           
2021-10-08T22:55:31Z Killing         Sent interrupt. Waiting 5s before force killing                            

Checks
ID SERVICE STATE OUTPUT 

Recent Logs

flyctl regions list

Region Pool:
atl
lax
lhr
Backup Region:
ams
cdg
iad
sea
sjc
vin

I’m not sure if this is due to the same underlying issue you’re seeing, but I’ve attempted two deploys in the past few hours via a GitHub Action and both hung for over an hour before I manually canceled.

==> Creating release

1791Release v144 created

1792Release command detected: this new release will not be available until the command succeeds.

1793

1794You can detach the terminal anytime without stopping the deployment

1795==> Release command

1796Command: /app/bin/enaia eval Enaia.Release.migrate

[…]

1797Error: The operation was canceled.

I am running this locally and nothing is hanging so not sure if it’s the same issue or not but sounds different. Possibly the same root cause though I guess.

These are, indeed, failing to boot on a specific host in London. We’re looking into it!

@enaia is your app running in London? It shouldn’t hang, that probably would have failed if the VM wouldn’t start.

We’re running only in IAD. Our deployment process was working earlier today and nothing about it has changed.

Thanks for the quick reply. Seems to have deployed successfully now. :crossed_fingers:

We pulled the misbehaving host out to see what’s up, thanks for letting us know!

@enaia if you run flyctl status do you see a new version deployed? Sometimes the CLI hangs querying our API, but once that release command starts you don’t need to be connected for the deploy to continue.

Ok I just checked, it looks like version 144 deployed successfully, despite our CLI hanging. I see two more versions have been deployed since then as well. We have some issues filed for those kinds of CLI hangs, they should go away over time.

I am having this issue also by for the sydney region. How can I troubleshoot?

Deployment Status
  ID          = a41e5f11-0058-baa9-7063-6be8b63f161c                                           
  Version     = v1                                                                             
  Status      = failed                                                                         
  Description = Failed due to unhealthy allocations - no stable job version to auto revert to  
  Instances   = 1 desired, 1 placed, 0 healthy, 1 unhealthy

Found the issue, there was an issue with my build. I had an unexpected change I didn’t account for.

1 Like