Fly deploy failed due to unhealthy allocations

victor.degliame · October 8, 2021, 12:05am

I am unable to understand what might be wrong with my elixir deploy.

These 2 issues (first, second) seemed similar but they unfortunately didn’t help. My app does listen on port 4000 (same as the fly.toml config) and I do not believe it takes long for it to start listening on it.

It succeeded once but then proceeded to fail right away and apparently restart over and over.

Here are the logs I get upon running fly deploy (I was getting the same thing when the app was continuously restarting and also when running fly vm status <id>):

Preparing kernel init
Configuring firecracker
Starting virtual machine
Starting init (commit: 50ffe20)...
Preparing to run: `/app/entrypoint.sh /app/bin/my_app eval MyApp.Release.migrate` as root
2021/10/07 23:47:35 listening on [fdaa:0:357c:a7b:2203:4241:e3ff:2]:22 (DNS: [fdaa::3]:53)
Reaped child process with pid: 563 and signal: SIGUSR1, core dumped? false
23:47:39.497 [info] Migrations already up
Reaped child process with pid: 565 and signal: SIGUSR1, core dumped? false
Reaped child process with pid: 612 and signal: SIGUSR1, core dumped? false
23:47:41.677 [info] Migrations already up
Main child exited normally with code: 0
Reaped child process with pid: 614 and signal: SIGUSR1, core dumped? false
Starting clean up.
...

The final log is

[error] Health check status changed 'warning' => 'critical'
***v8 failed - Failed due to unhealthy allocations - not rolling back to stable job version 8 as current job has same specification and deploying as v9

Here is my fly.toml:

app = "my-app"

kill_signal = "SIGTERM"
kill_timeout = 5
processes = []

[deploy]
  release_command = "/app/bin/my_app eval MyApp.Release.migrate"

[env]

[experimental]
  allowed_public_ports = []
  auto_rollback = true
  private_network=true

[[services]]
  http_checks = []
  internal_port = 4000
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "30s"
    interval = "15s"
    restart_limit = 6
    timeout = "2s"
    port = "4000"

My app contains big files in the priv folder (total is about 100MB), which I load in a GenServer’s handle_continue (meaning it first starts listening on port 4000 and then loads the data); if that helps.

kurt · October 8, 2021, 12:07am

Will you paste the output of that command here? The top section shows an event log, what you’ll see in there is either an exit with a code, or a healthcheck failure.

victor.degliame · October 8, 2021, 12:10am

I picked one of the failing instances and ran fly vm status 083bf3e1

Instance
  ID            = 083bf3e1             
  Task          =                      
  Version       = 8                    
  Region        = cdg                  
  Desired       = stop                 
  Status        = complete             
  Health Checks = 1 total, 1 critical  
  Restarts      = 2                    
  Created       = 21m38s ago           

Recent Events
TIMESTAMP            TYPE             MESSAGE                                                         
2021-10-07T23:47:50Z Received         Task received by client                                         
2021-10-07T23:47:50Z Task Setup       Building Task Directory                                         
2021-10-07T23:48:01Z Started          Task started by client                                          
2021-10-07T23:49:46Z Restart Signaled healthcheck: check "a61773ab9e61f7afdefca4f759fca6f9" unhealthy 
2021-10-07T23:49:57Z Terminated       Exit Code: 0                                                    
2021-10-07T23:49:57Z Restarting       Task restarting in 1.165105052s                                 
2021-10-07T23:50:04Z Started          Task started by client                                          
2021-10-07T23:51:51Z Restart Signaled healthcheck: check "a61773ab9e61f7afdefca4f759fca6f9" unhealthy 
2021-10-07T23:52:00Z Terminated       Exit Code: 0                                                    
2021-10-07T23:52:00Z Restarting       Task restarting in 1.030397041s                                 
2021-10-07T23:52:08Z Started          Task started by client                                          
2021-10-07T23:52:50Z Alloc Unhealthy  Task not running for min_healthy_time of 10s by deadline        
2021-10-07T23:52:51Z Killing          Sent interrupt. Waiting 5s before force killing                 
2021-10-07T23:53:14Z Terminated       Exit Code: 0                                                    
2021-10-07T23:53:14Z Killed           Task successfully killed                                        

Checks
ID                               SERVICE  STATE    OUTPUT                                                 
a61773ab9e61f7afdefca4f759fca6f9 tcp-4000 critical dial tcp 172.19.3.18:4000: connect: connection refused

kurt · October 8, 2021, 12:16am

So that’s saying it starts, and then 1m45s later the healthcheck hasn’t passed. The checks output is showing that it can’t connect to port 4000.

This could mean a number of things, either it’s not actually listening on port 4000, or it’s not listening on the right set of IP addresses.

One way you can troubleshoot this is to remove the [[services]] block entirely, deploy the app, and then fly ssh console to it and see what you can connect to. Removing the services block will make it inaccessible from outside, but it will let it run so you can prod at it.

victor.degliame · October 8, 2021, 12:57am

I had a typo in a config file and wasn’t listening on port 4000 after all… Thanks for helping me figure it out Kurt!

Now the app is running and I was able to confirm it through the ssh console. However, I cannot access it via https://my-app.fly.dev as I get This site can’t be reached. Logs don’t show any connection attempt.

I’m wondering if it is an issue with https. I’ll try a few different configs and open a different issue if I cannot figure it out.

kurt · October 8, 2021, 12:59am

Nice! You added the [[services]] block back? That public https URL will only work with services configured.

victor.degliame · October 8, 2021, 1:07am

I never removed it because I found the problem before that was needed

kurt · October 8, 2021, 3:11am

Oh! It seems like your app didn’t get IPs allocated. I’ve fixed that up, you should be getting responses now.

victor.degliame · October 8, 2021, 9:57am

Thank you so much, I was wondering why it was working now after I didn’t do anything

On to coding more stuff now!

Topic		Replies	Views
Unable to deploy Elixir Phoenix app: "Failed due to unhealthy allocations" Phoenix elixir , postgres	16	2990	September 29, 2022
Failed due to unhealthy allocations - no stable job version to auto revert to Questions / Help	6	1890	November 30, 2022
Deployment suddenly fails: Failed due to unhealthy allocations, but it's listening on [::]:8080 Questions / Help	3	474	December 25, 2022
deploys failing due to "unhealthy allocations" Questions / Help	4	2100	October 26, 2022
v1 Deployment failed due to unhealthy allocations Questions / Help	7	794	November 12, 2022

Fly deploy failed due to unhealthy allocations

Related topics