Hello there, I have a Phoenix application that I have been deploying on Fly.io for quite a while now. However, my app always experiences about 15-30 seconds of downtime on each deploy. I added a health check and explicitly set my deploy to “canary” (since it was rolling with one machine before) and it seemed to have little or no effect.
Here are some of the logs. I am also witnessing some odd errors machines in a non-startable state.
2024-03-17T20:37:35Z app[4d89d054c0e208] lax [info] INFO Sending signal SIGTERM to main child process w/ PID 314
2024-03-17T20:37:35Z app[4d89d054c0e208] lax [info] WARN Reaped child process with pid: 375 and signal: SIGUSR1, core dumped? false
2024-03-17T20:37:35Z app[4d89d054c0e208] lax [info] WARN Reaped child process with pid: 376 and signal: SIGUSR1, core dumped? false
2024-03-17T20:37:36Z app[4d89d054c0e208] lax [info] INFO Main child exited normally with code: 0
2024-03-17T20:37:36Z app[4d89d054c0e208] lax [info] INFO Starting clean up.
2024-03-17T20:37:36Z app[4d89d054c0e208] lax [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2024-03-17T20:37:36Z app[4d89d054c0e208] lax [info]2024/03/17 20:37:36 listening on [fdaa:4:b1eb:a7b:235:e8d8:3a23:2]:22 (DNS: [fdaa::3]:53)
2024-03-17T20:37:36Z app[6e82479eb20048] lax [info]20:37:36.980 request_id=F72nu3HD5UUbI8kAAAGy [info] GET /chats/aee63e7c-0cce-42af-ae33-821c8e7fa5eb
2024-03-17T20:37:36Z app[6e82479eb20048] lax [info]20:37:36.995 request_id=F72nu3HD5UUbI8kAAAGy [info] Sent 200 in 15ms
2024-03-17T20:37:37Z app[4d89d054c0e208] lax [info][ 9.416599] reboot: Restarting system
2024-03-17T20:37:37Z runner[6e82479eb20048] lax [info]Pulling container image registry.fly.io/axflow-studio:deployment-01HS723QKYHC4F83JH6FKE7Q4D
2024-03-17T20:37:38Z runner[6e82479eb20048] lax [info]Successfully prepared image registry.fly.io/axflow-studio:deployment-01HS723QKYHC4F83JH6FKE7Q4D (660.369963ms)
2024-03-17T20:37:39Z runner[6e82479eb20048] lax [info]Configuring firecracker
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info] INFO Sending signal SIGTERM to main child process w/ PID 314
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.215 [notice] SIGTERM received - shutting down
2024-03-17T20:37:39Z proxy[4d89d054c0e208] lax [error]timed out while connecting to your instance. this indicates a problem with your app (hint: look at your logs and metrics)
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2474.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2468.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2476.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2475.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2473.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2469.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2472.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:39Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:39Z proxy[4d89d054c0e208] lax [error]machine is in a non-startable state: destroyed
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info] WARN Reaped child process with pid: 375 and signal: SIGUSR1, core dumped? false
2024-03-17T20:37:40Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:40Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:40Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:40Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:40Z app[6e82479eb20048] lax [info] INFO Main child exited normally with code: 0
2024-03-17T20:37:40Z app[6e82479eb20048] lax [info] INFO Starting clean up.
2024-03-17T20:37:40Z app[6e82479eb20048] lax [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2024-03-17T20:37:40Z app[6e82479eb20048] lax [info]2024/03/17 20:37:40 listening on [fdaa:4:b1eb:a7b:26a:ea64:240c:2]:22 (DNS: [fdaa::3]:53)
2024-03-17T20:37:41Z app[6e82479eb20048] lax [info][ 187.829617] reboot: Restarting system
2024-03-17T20:37:42Z app[6e82479eb20048] lax [info][ 0.223092] PCI: Fatal: No config space access function found
2024-03-17T20:37:42Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:42Z app[6e82479eb20048] lax [info] INFO Starting init (commit: 913ad9c)...
2024-03-17T20:37:43Z app[6e82479eb20048] lax [info] INFO Preparing to run: `/app/bin/server` as nobody
2024-03-17T20:37:43Z app[6e82479eb20048] lax [info] INFO [fly api proxy] listening at /.fly/api
2024-03-17T20:37:43Z app[6e82479eb20048] lax [info]2024/03/17 20:37:43 listening on [fdaa:4:b1eb:a7b:26a:ea64:240c:2]:22 (DNS: [fdaa::3]:53)
2024-03-17T20:37:43Z runner[6e82479eb20048] lax [info]Machine created and started in 5.441s
2024-03-17T20:37:44Z proxy[6e82479eb20048] lax [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2024-03-17T20:37:46Z app[6e82479eb20048] lax [info]20:37:46.470 [info] Running StudioWeb.Endpoint with Bandit 1.2.2 at :::8080 (http)
Is downtime expected? I must be missing something, as I would assume it’s critical to keep one instance alive while a deployment is happening.
Here is my fly.toml
app = "my-app"
primary_region = "lax"
kill_signal = "SIGTERM"
kill_timeout = 180
[build]
[deploy]
strategy = "canary"
release_command = "/app/bin/migrate"
[env]
PHX_HOST = "studio.axflow.dev"
PORT = "8080"
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = false
auto_start_machines = false
min_machines_running = 1
processes = ["app"]
[http_service.concurrency]
type = "connections"
hard_limit = 1000
soft_limit = 1000
[[http_service.checks]]
grace_period = "10s"
interval = "30s"
method = "GET"
timeout = "5s"
path = "/healthz"
[[vm]]
cpu_kind = "shared"
cpus = 2
memory_mb = 4096
Any ideas for how I can get my app to a place where deploying does not incur any downtime?