I am unable to understand what might be wrong with my elixir deploy.
These 2 issues (first, second) seemed similar but they unfortunately didn’t help. My app does listen on port 4000 (same as the fly.toml config) and I do not believe it takes long for it to start listening on it.
It succeeded once but then proceeded to fail right away and apparently restart over and over.
Here are the logs I get upon running fly deploy (I was getting the same thing when the app was continuously restarting and also when running fly vm status <id>):
Preparing kernel init
Configuring firecracker
Starting virtual machine
Starting init (commit: 50ffe20)...
Preparing to run: `/app/entrypoint.sh /app/bin/my_app eval MyApp.Release.migrate` as root
2021/10/07 23:47:35 listening on [fdaa:0:357c:a7b:2203:4241:e3ff:2]:22 (DNS: [fdaa::3]:53)
Reaped child process with pid: 563 and signal: SIGUSR1, core dumped? false
23:47:39.497 [info] Migrations already up
Reaped child process with pid: 565 and signal: SIGUSR1, core dumped? false
Reaped child process with pid: 612 and signal: SIGUSR1, core dumped? false
23:47:41.677 [info] Migrations already up
Main child exited normally with code: 0
Reaped child process with pid: 614 and signal: SIGUSR1, core dumped? false
Starting clean up.
...
The final log is
[error] Health check status changed 'warning' => 'critical'
***v8 failed - Failed due to unhealthy allocations - not rolling back to stable job version 8 as current job has same specification and deploying as v9
My app contains big files in the priv folder (total is about 100MB), which I load in a GenServer’s handle_continue (meaning it first starts listening on port 4000 and then loads the data); if that helps.
Will you paste the output of that command here? The top section shows an event log, what you’ll see in there is either an exit with a code, or a healthcheck failure.
I picked one of the failing instances and ran fly vm status 083bf3e1
Instance
ID = 083bf3e1
Task =
Version = 8
Region = cdg
Desired = stop
Status = complete
Health Checks = 1 total, 1 critical
Restarts = 2
Created = 21m38s ago
Recent Events
TIMESTAMP TYPE MESSAGE
2021-10-07T23:47:50Z Received Task received by client
2021-10-07T23:47:50Z Task Setup Building Task Directory
2021-10-07T23:48:01Z Started Task started by client
2021-10-07T23:49:46Z Restart Signaled healthcheck: check "a61773ab9e61f7afdefca4f759fca6f9" unhealthy
2021-10-07T23:49:57Z Terminated Exit Code: 0
2021-10-07T23:49:57Z Restarting Task restarting in 1.165105052s
2021-10-07T23:50:04Z Started Task started by client
2021-10-07T23:51:51Z Restart Signaled healthcheck: check "a61773ab9e61f7afdefca4f759fca6f9" unhealthy
2021-10-07T23:52:00Z Terminated Exit Code: 0
2021-10-07T23:52:00Z Restarting Task restarting in 1.030397041s
2021-10-07T23:52:08Z Started Task started by client
2021-10-07T23:52:50Z Alloc Unhealthy Task not running for min_healthy_time of 10s by deadline
2021-10-07T23:52:51Z Killing Sent interrupt. Waiting 5s before force killing
2021-10-07T23:53:14Z Terminated Exit Code: 0
2021-10-07T23:53:14Z Killed Task successfully killed
Checks
ID SERVICE STATE OUTPUT
a61773ab9e61f7afdefca4f759fca6f9 tcp-4000 critical dial tcp 172.19.3.18:4000: connect: connection refused
So that’s saying it starts, and then 1m45s later the healthcheck hasn’t passed. The checks output is showing that it can’t connect to port 4000.
This could mean a number of things, either it’s not actually listening on port 4000, or it’s not listening on the right set of IP addresses.
One way you can troubleshoot this is to remove the [[services]] block entirely, deploy the app, and then fly ssh console to it and see what you can connect to. Removing the services block will make it inaccessible from outside, but it will let it run so you can prod at it.
I had a typo in a config file and wasn’t listening on port 4000 after all… Thanks for helping me figure it out Kurt!
Now the app is running and I was able to confirm it through the ssh console. However, I cannot access it via https://my-app.fly.dev as I get This site can’t be reached. Logs don’t show any connection attempt.
I’m wondering if it is an issue with https. I’ll try a few different configs and open a different issue if I cannot figure it out.