I have a fairly straightforward Phoenix app that I am trying to deploy. The problem is likely simple but I need some rubber duck help. Here is the failure:
==> Failure #1
Instance
ID = <redacted> Task = Version = 24
Region = sea
Desired = run Status = running Health Checks = 1 total, 1 critical
Restarts = 2 Created = 4m35s ago
Recent Events
TIMESTAMP TYPE MESSAGE 2021-09-26T14:47:32Z Received Task received by client 2021-09-26T14:47:32Z Task Setup Building Task Directory
2021-09-26T14:47:35Z Started Task started by client 2021-09-26T14:49:22Z Restart Signaled healthcheck: check "f557ee1d070fbd1796f8c0685877b786" unhealthy 2021-09-26T14:49:31Z Terminated Exit Code: 0
2021-09-26T14:49:31Z Restarting Task restarting in 1.007520679s
2021-09-26T14:49:38Z Started Task started by client 2021-09-26T14:51:24Z Restart Signaled healthcheck: check "f557ee1d070fbd1796f8c0685877b786" unhealthy 2021-09-26T14:51:34Z Terminated Exit Code: 0
2021-09-26T14:51:34Z Restarting Task restarting in 1.165986767s 2021-09-26T14:51:40Z Started Task started by client
Recent Logs
2021-09-26T14:49:29Z [info] Health check status changed 'critical' => 'passing'
2021-09-26T14:49:36Z [info] Starting instance
2021-09-26T14:49:36Z [info] Configuring virtual machine
2021-09-26T14:49:36Z [info] Pulling container image
2021-09-26T14:49:37Z [info] Unpacking image
2021-09-26T14:49:37Z [info] Preparing kernel init
2021-09-26T14:49:37Z [info] Configuring firecracker
2021-09-26T14:49:37Z [info] Starting virtual machine
2021-09-26T14:49:38Z [info] Starting init (commit: 50ffe20)...
2021-09-26T14:49:38Z [info] Preparing to run: `bin/shrimp start` as nobody
2021-09-26T14:49:38Z [info] 2021/09/26 14:49:38 listening on [<redacted>]:22 (DNS: [fdaa::3]:53)
2021-09-26T14:49:39Z [info] Reaped child process with pid: 547, exit code: 0
2021-09-26T14:49:41Z [info] Reaped child process with pid: 568 and signal: SIGUSR1, core dumped? false
2021-09-26T14:51:24Z [info] Shutting down virtual machine
2021-09-26T14:51:24Z [info] Sending signal SIGTERM to main child process w/ PID 510
2021-09-26T14:51:25Z [info] Reaped child process with pid: 570 and signal: SIGUSR1, core dumped? false
2021-09-26T14:51:30Z [info] Health check status changed 'critical' => 'passing'
2021-09-26T14:51:39Z [info] Starting instance
2021-09-26T14:51:39Z [info] Configuring virtual machine
2021-09-26T14:51:39Z [info] Pulling container image
2021-09-26T14:51:39Z [info] Unpacking image
2021-09-26T14:51:39Z [info] Preparing kernel init
2021-09-26T14:51:40Z [info] Configuring firecracker
2021-09-26T14:51:40Z [info] Starting virtual machine
2021-09-26T14:51:40Z [info] Starting init (commit: 50ffe20)...
2021-09-26T14:51:40Z [info] Preparing to run: `bin/shrimp start` as nobody
2021-09-26T14:51:40Z [info] 2021/09/26 14:51:40 listening on [<redacted>]:22 (DNS: [fdaa::3]:53)
2021-09-26T14:51:41Z [info] Reaped child process with pid: 547, exit code: 0
2021-09-26T14:51:43Z [info] Reaped child process with pid: 568 and signal: SIGUSR1, core dumped? false
2021-09-26T14:51:55Z [error] Health check status changed 'warning' => 'critical'
In reviewing the Discourse topics it seems that this is most frequently caused by mismatched PORT
settings between the Dockerfile
, fly.toml
, and the app’s config/runtime.exs
, but all of those are in agreement to use PORT=4001
.
Here is the fly.toml
:
# fly.toml file generated for <redacted> on 2021-09-24T00:56:04Z
app = "<redacted>"
kill_signal = "SIGTERM"
kill_timeout = 5
processes = []
[deploy]
# release_command = "/app/bin/shrimp eval Shrimp.Release.migrate"
[env]
[experimental]
allowed_public_ports = []
auto_rollback = true
[[services]]
http_checks = []
internal_port = 4001
processes = ["app"]
protocol = "tcp"
script_checks = []
[services.concurrency]
hard_limit = 25
soft_limit = 20
type = "connections"
[[services.ports]]
handlers = ["http"]
port = 80
[[services.ports]]
handlers = ["tls", "http"]
port = 443
[[services.tcp_checks]]
grace_period = "30s"
interval = "15s"
restart_limit = 6
timeout = "2s"
I commented out the release_command
to see if I could just get the app to start but it failed in either scenario (commented, or uncommented).
I also ran LOG_LEVEL=debug fly deploy
but I won’t include it here as it was very overwhelming. I’m happy to grep
through it for something specific.
This app does use an :ets
table but even leaving that GenServer
out of the Supervisor
chain doesn’t seem to prevent the SIGUSR1
signal from ending the child processes. Additionally, the app isn’t sending any exception traces to Sentry.
I had a prior deploy of another app that had similar issues because the BEAM memory size was bumping up against the 256GB limit, so I scaled the memory on this vm to 2048 but that did not resolve the issue.
Lastly, I know it is Sunday still in most of the world so please do not address this until normal work hours, this is just the time I have to post. Thanks in advance for any help. Happy to provide any other info.