Fly deploy failed due to unhealthy allocations

I am unable to understand what might be wrong with my elixir deploy.

These 2 issues (first, second) seemed similar but they unfortunately didn’t help. My app does listen on port 4000 (same as the fly.toml config) and I do not believe it takes long for it to start listening on it.

It succeeded once but then proceeded to fail right away and apparently restart over and over.

Here are the logs I get upon running fly deploy (I was getting the same thing when the app was continuously restarting and also when running fly vm status <id>):

Preparing kernel init
Configuring firecracker
Starting virtual machine
Starting init (commit: 50ffe20)...
Preparing to run: `/app/ /app/bin/my_app eval MyApp.Release.migrate` as root
2021/10/07 23:47:35 listening on [fdaa:0:357c:a7b:2203:4241:e3ff:2]:22 (DNS: [fdaa::3]:53)
Reaped child process with pid: 563 and signal: SIGUSR1, core dumped? false
23:47:39.497 [info] Migrations already up
Reaped child process with pid: 565 and signal: SIGUSR1, core dumped? false
Reaped child process with pid: 612 and signal: SIGUSR1, core dumped? false
23:47:41.677 [info] Migrations already up
Main child exited normally with code: 0
Reaped child process with pid: 614 and signal: SIGUSR1, core dumped? false
Starting clean up.

The final log is

[error] Health check status changed 'warning' => 'critical'
***v8 failed - Failed due to unhealthy allocations - not rolling back to stable job version 8 as current job has same specification and deploying as v9 

Here is my fly.toml:

app = "my-app"

kill_signal = "SIGTERM"
kill_timeout = 5
processes = []

  release_command = "/app/bin/my_app eval MyApp.Release.migrate"


  allowed_public_ports = []
  auto_rollback = true

  http_checks = []
  internal_port = 4000
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

    hard_limit = 25
    soft_limit = 20
    type = "connections"

    handlers = ["http"]
    port = 80

    handlers = ["tls", "http"]
    port = 443

    grace_period = "30s"
    interval = "15s"
    restart_limit = 6
    timeout = "2s"
    port = "4000"

My app contains big files in the priv folder (total is about 100MB), which I load in a GenServer’s handle_continue (meaning it first starts listening on port 4000 and then loads the data); if that helps.

Will you paste the output of that command here? The top section shows an event log, what you’ll see in there is either an exit with a code, or a healthcheck failure.

I picked one of the failing instances and ran fly vm status 083bf3e1

  ID            = 083bf3e1             
  Task          =                      
  Version       = 8                    
  Region        = cdg                  
  Desired       = stop                 
  Status        = complete             
  Health Checks = 1 total, 1 critical  
  Restarts      = 2                    
  Created       = 21m38s ago           

Recent Events
TIMESTAMP            TYPE             MESSAGE                                                         
2021-10-07T23:47:50Z Received         Task received by client                                         
2021-10-07T23:47:50Z Task Setup       Building Task Directory                                         
2021-10-07T23:48:01Z Started          Task started by client                                          
2021-10-07T23:49:46Z Restart Signaled healthcheck: check "a61773ab9e61f7afdefca4f759fca6f9" unhealthy 
2021-10-07T23:49:57Z Terminated       Exit Code: 0                                                    
2021-10-07T23:49:57Z Restarting       Task restarting in 1.165105052s                                 
2021-10-07T23:50:04Z Started          Task started by client                                          
2021-10-07T23:51:51Z Restart Signaled healthcheck: check "a61773ab9e61f7afdefca4f759fca6f9" unhealthy 
2021-10-07T23:52:00Z Terminated       Exit Code: 0                                                    
2021-10-07T23:52:00Z Restarting       Task restarting in 1.030397041s                                 
2021-10-07T23:52:08Z Started          Task started by client                                          
2021-10-07T23:52:50Z Alloc Unhealthy  Task not running for min_healthy_time of 10s by deadline        
2021-10-07T23:52:51Z Killing          Sent interrupt. Waiting 5s before force killing                 
2021-10-07T23:53:14Z Terminated       Exit Code: 0                                                    
2021-10-07T23:53:14Z Killed           Task successfully killed                                        

ID                               SERVICE  STATE    OUTPUT                                                 
a61773ab9e61f7afdefca4f759fca6f9 tcp-4000 critical dial tcp connect: connection refused 

So that’s saying it starts, and then 1m45s later the healthcheck hasn’t passed. The checks output is showing that it can’t connect to port 4000.

This could mean a number of things, either it’s not actually listening on port 4000, or it’s not listening on the right set of IP addresses.

One way you can troubleshoot this is to remove the [[services]] block entirely, deploy the app, and then fly ssh console to it and see what you can connect to. Removing the services block will make it inaccessible from outside, but it will let it run so you can prod at it.

I had a typo in a config file and wasn’t listening on port 4000 after all… Thanks for helping me figure it out Kurt!

Now the app is running and I was able to confirm it through the ssh console. However, I cannot access it via as I get This site can’t be reached. Logs don’t show any connection attempt.

I’m wondering if it is an issue with https. I’ll try a few different configs and open a different issue if I cannot figure it out.

Nice! You added the [[services]] block back? That public https URL will only work with services configured.

I never removed it because I found the problem before that was needed :slight_smile:

Oh! It seems like your app didn’t get IPs allocated. I’ve fixed that up, you should be getting responses now.

1 Like

Thank you so much, I was wondering why it was working now after I didn’t do anything :smiley:

On to coding more stuff now!