Application downtime while deploying

Hello there, I have a Phoenix application that I have been deploying on Fly.io for quite a while now. However, my app always experiences about 15-30 seconds of downtime on each deploy. I added a health check and explicitly set my deploy to “canary” (since it was rolling with one machine before) and it seemed to have little or no effect.

Here are some of the logs. I am also witnessing some odd errors machines in a non-startable state.

2024-03-17T20:37:35Z app[4d89d054c0e208] lax [info] INFO Sending signal SIGTERM to main child process w/ PID 314
2024-03-17T20:37:35Z app[4d89d054c0e208] lax [info] WARN Reaped child process with pid: 375 and signal: SIGUSR1, core dumped? false
2024-03-17T20:37:35Z app[4d89d054c0e208] lax [info] WARN Reaped child process with pid: 376 and signal: SIGUSR1, core dumped? false
2024-03-17T20:37:36Z app[4d89d054c0e208] lax [info] INFO Main child exited normally with code: 0
2024-03-17T20:37:36Z app[4d89d054c0e208] lax [info] INFO Starting clean up.
2024-03-17T20:37:36Z app[4d89d054c0e208] lax [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2024-03-17T20:37:36Z app[4d89d054c0e208] lax [info]2024/03/17 20:37:36 listening on [fdaa:4:b1eb:a7b:235:e8d8:3a23:2]:22 (DNS: [fdaa::3]:53)
2024-03-17T20:37:36Z app[6e82479eb20048] lax [info]20:37:36.980 request_id=F72nu3HD5UUbI8kAAAGy [info] GET /chats/aee63e7c-0cce-42af-ae33-821c8e7fa5eb
2024-03-17T20:37:36Z app[6e82479eb20048] lax [info]20:37:36.995 request_id=F72nu3HD5UUbI8kAAAGy [info] Sent 200 in 15ms
2024-03-17T20:37:37Z app[4d89d054c0e208] lax [info][    9.416599] reboot: Restarting system
2024-03-17T20:37:37Z runner[6e82479eb20048] lax [info]Pulling container image registry.fly.io/axflow-studio:deployment-01HS723QKYHC4F83JH6FKE7Q4D
2024-03-17T20:37:38Z runner[6e82479eb20048] lax [info]Successfully prepared image registry.fly.io/axflow-studio:deployment-01HS723QKYHC4F83JH6FKE7Q4D (660.369963ms)
2024-03-17T20:37:39Z runner[6e82479eb20048] lax [info]Configuring firecracker
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info] INFO Sending signal SIGTERM to main child process w/ PID 314
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.215 [notice] SIGTERM received - shutting down
2024-03-17T20:37:39Z proxy[4d89d054c0e208] lax [error]timed out while connecting to your instance. this indicates a problem with your app (hint: look at your logs and metrics)
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2474.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2468.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2476.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2475.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2473.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2469.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info]20:37:39.241 [info] Postgrex.Protocol (#PID<0.2472.0>) missed message: {:EXIT, #PID<0.2821.0>, :shutdown}
2024-03-17T20:37:39Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:39Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:39Z proxy[4d89d054c0e208] lax [error]machine is in a non-startable state: destroyed
2024-03-17T20:37:39Z app[6e82479eb20048] lax [info] WARN Reaped child process with pid: 375 and signal: SIGUSR1, core dumped? false
2024-03-17T20:37:40Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:40Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:40Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:40Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:40Z app[6e82479eb20048] lax [info] INFO Main child exited normally with code: 0
2024-03-17T20:37:40Z app[6e82479eb20048] lax [info] INFO Starting clean up.
2024-03-17T20:37:40Z app[6e82479eb20048] lax [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2024-03-17T20:37:40Z app[6e82479eb20048] lax [info]2024/03/17 20:37:40 listening on [fdaa:4:b1eb:a7b:26a:ea64:240c:2]:22 (DNS: [fdaa::3]:53)
2024-03-17T20:37:41Z app[6e82479eb20048] lax [info][  187.829617] reboot: Restarting system
2024-03-17T20:37:42Z app[6e82479eb20048] lax [info][    0.223092] PCI: Fatal: No config space access function found
2024-03-17T20:37:42Z proxy[6e82479eb20048] lax [error]machine is in a non-startable state: created
2024-03-17T20:37:42Z app[6e82479eb20048] lax [info] INFO Starting init (commit: 913ad9c)...
2024-03-17T20:37:43Z app[6e82479eb20048] lax [info] INFO Preparing to run: `/app/bin/server` as nobody
2024-03-17T20:37:43Z app[6e82479eb20048] lax [info] INFO [fly api proxy] listening at /.fly/api
2024-03-17T20:37:43Z app[6e82479eb20048] lax [info]2024/03/17 20:37:43 listening on [fdaa:4:b1eb:a7b:26a:ea64:240c:2]:22 (DNS: [fdaa::3]:53)
2024-03-17T20:37:43Z runner[6e82479eb20048] lax [info]Machine created and started in 5.441s
2024-03-17T20:37:44Z proxy[6e82479eb20048] lax [error]instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2024-03-17T20:37:46Z app[6e82479eb20048] lax [info]20:37:46.470 [info] Running StudioWeb.Endpoint with Bandit 1.2.2 at :::8080 (http)

Is downtime expected? I must be missing something, as I would assume it’s critical to keep one instance alive while a deployment is happening.

Here is my fly.toml

app = "my-app"
primary_region = "lax"
kill_signal = "SIGTERM"
kill_timeout = 180

[build]

[deploy]
strategy = "canary"
release_command = "/app/bin/migrate"

[env]
PHX_HOST = "studio.axflow.dev"
PORT = "8080"

[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = false
auto_start_machines = false
min_machines_running = 1
processes = ["app"]

[http_service.concurrency]
type = "connections"
hard_limit = 1000
soft_limit = 1000

[[http_service.checks]]
grace_period = "10s"
interval = "30s"
method = "GET"
timeout = "5s"
path = "/healthz"

[[vm]]
cpu_kind = "shared"
cpus = 2
memory_mb = 4096

Any ideas for how I can get my app to a place where deploying does not incur any downtime?

You can try making your checks more aggressive:

    interval = "5s"
    timeout = "2s"
    grace_period = "1s"

And you may try running fly deploy --strategy bluegreen

For a typical Phoenix app with --strategy immediate (don’t even try rolling, kill everything and start everything) I only see 7s of reconnect time. For bluegreen this drops to 2s. My guess is the 30s interval is hurting you here. If your app has a longer startup time you’ll want higher grace period and interval, but a standard phoenix supervision tree is up and ready to serve traffic in a few hundred ms.

Hey Chris, thanks for your response!

Agree about minimizing the grace_period and those other values. That did help.

While pragmatically I think this is fine for us (especially for now), I’m wondering if there’s a way to guarantee no downtime with Fly infrastructure (in the happy path, no errors, at least)? Added context here is that our deployment currently has only one machine. I assume if I bumped it to multiple machines, there’s a way to roll one at a time. But what about even with a single machine? Is it possible with Fly to bring up the new instance, route traffic to that, then take the other one offline? Or will it only ever be able to have one instance and thus some amount of downtime (even if only a second)?

Thank you,
Ben

You’d need to be running more than one machine for a rolling deploy to have something for the proxy to proxy to. Bluegreen should get you what you describe since fly will start N machines fresh, wait for them to be healthy, then route over to the new set.

Also, does your app do anything on startup that prevents quickly booting the endpoint and accesptng HTTP traffic? Like I mentioned, a standard phoenix app with DB should be up and running right away, within a few hundred ms. Are you’re blocking on start doing work? The proxy error logs indicate we can’t reach it within the time we expect to, but we really need more information about your app and if this is something you still see across deploys. Thanks!

does your app do anything on startup that prevents quickly booting

No, so I dropped those settings (which I think I found on some fly docs or this forum somewhere) down to much smaller values, e.g., grace_period = "2s". I also switched to bluegreen. This seems to have largely solved all my problems. So, thank you!

The one possible caveat here is that I am running Oban workers in my app. These can take up to a couple minutes. With graceful shutdowns in place, the app could take up to a few minutes to shutdown. BUT, my endpoint shuts down before Oban, so I’m worried that my app could be unavailable for a few minutes while the BG worker is finishing. This makes sense I suppose, but ideally there’d be an option (even when normally running only a single machine) for a Fly machine to boot and become available while my other one is shutting down, leading to zero downtime.

You’d need to be running more than one machine for a rolling deploy to have something for the proxy to proxy to

I think this was the confusion. While I know I have one machine running, I was assuming deploys to be the exception to this rule, i.e., spin up another maching and start to move traffic over to it before shutting the existing one down.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.