Rolling update always causes downtime (proxy errors in log)

I cannot get a rolling deployment working that does not cause downtime for my app for at least 50-10 seconds. Using bluegreen I don’t get any downtime, but rolling always causes downtime (which it shouldn’t according to my understanding). I am also running 2 instances, so a max_unavailable of 1 should work IMO.

Here’s my fly.toml

app = 'foobar'
primary_region = 'ams'

[build]

[deploy]
  strategy = "rolling"
  max_unavailable = 1

[env]
  PORT = '3000'

[http_service]
  internal_port = 3000
  force_https = true
  auto_stop_machines = "stop"
  auto_start_machines = true
  min_machines_running = 1
  processes = ['app']

[[http_service.checks]]
  grace_period = "20s"
  interval = "30s"
  method = "GET"
  timeout = "5s"
  path = "/healthcheck"

[[vm]]
  size = 'shared-cpu-1x'
  memory = '512mb'
  cpu_kind = 'shared'
  cpus = 1

The end of the Dockerfile:

EXPOSE 3000
CMD [ "npm", "run", "start" ]

npm run start in turns calls NODE_ENV=production NODE_OPTIONS='--import ./instrumentation.server.mjs' PORT=3000 HOST=0.0.0.0 node ./server.js", where server.jsruns an express server, listening on the specified HOST and PORT.

During deploy I am getting the following logs from fly:

2025-03-06T17:16:26.379 runner[d8dd650b3d3e48] ams [info] Pulling container image registry.fly.io/dealday:deployment-01JNP75MSKBPTQ4V9V305DSYW4
2025-03-06T17:16:26.966 runner[78190eb9753198] ams [info] Pulling container image registry.fly.io/dealday:deployment-01JNP75MSKBPTQ4V9V305DSYW4
2025-03-06T17:16:27.317 proxy[d8dd650b3d3e48] ams [error] [PM08] machine is in a non-startable state: created
2025-03-06T17:16:27.351 proxy[78190eb9753198] ams [error] [PM08] machine is in a non-startable state: created
2025-03-06T17:16:27.505 proxy[d8dd650b3d3e48] ams [error] [PM08] machine is in a non-startable state: created
2025-03-06T17:16:27.633 proxy[78190eb9753198] ams [error] [PM08] machine is in a non-startable state: created
2025-03-06T17:16:28.394 proxy[d8dd650b3d3e48] ams [error] [PM08] machine is in a non-startable state: created
2025-03-06T17:16:32.465 runner[d8dd650b3d3e48] ams [info] Successfully prepared image registry.fly.io/dealday:deployment-01JNP75MSKBPTQ4V9V305DSYW4 (6.086634931s)
2025-03-06T17:16:32.922 runner[d8dd650b3d3e48] ams [info] Configuring firecracker
2025-03-06T17:16:32.956 app[d8dd650b3d3e48] ams [info] INFO Sending signal SIGINT to main child process w/ PID 627
2025-03-06T17:16:34.022 runner[78190eb9753198] ams [info] Successfully prepared image registry.fly.io/dealday:deployment-01JNP75MSKBPTQ4V9V305DSYW4 (7.055153559s)
2025-03-06T17:16:34.137 app[d8dd650b3d3e48] ams [info] GET /healthcheck 200 - 7 ms - 172.19.15.153
2025-03-06T17:16:34.991 runner[78190eb9753198] ams [info] Configuring firecracker
2025-03-06T17:16:38.067 app[d8dd650b3d3e48] ams [info] INFO Sending signal SIGTERM to main child process w/ PID 627
2025-03-06T17:16:38.789 app[d8dd650b3d3e48] ams [info] INFO Main child exited with signal (with signal 'SIGTERM', core dumped? false)
2025-03-06T17:16:38.806 app[d8dd650b3d3e48] ams [info] INFO Starting clean up.
2025-03-06T17:16:38.849 app[d8dd650b3d3e48] ams [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2025-03-06T17:16:38.849 app[d8dd650b3d3e48] ams [info] [ 390.572070] reboot: Restarting system
2025-03-06T17:16:39.667 health[d8dd650b3d3e48] ams [error] Health check on port 3000 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
2025-03-06T17:16:42.395 proxy[d8dd650b3d3e48] fra [error] [PR04] could not find a good candidate within 20 attempts at load balancing
2025-03-06T17:16:44.669 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:16:44.670 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:16:54Z"
2025-03-06T17:16:46.679 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:16:46.680 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:16:58Z"
2025-03-06T17:16:48.688 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:16:48.690 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:16:58Z"
2025-03-06T17:16:49.297 app[d8dd650b3d3e48] ams [info] 2025-03-06T17:16:49.297283953 [01JNP7AQVNB7ZBTBWC0ZHP7VDC:main] Running Firecracker v1.7.0
2025-03-06T17:16:50.220 app[d8dd650b3d3e48] ams [info] INFO Starting init (commit: d15e62a13)...
2025-03-06T17:16:50.358 app[d8dd650b3d3e48] ams [info] INFO Preparing to run: `docker-entrypoint.sh npm run start` as root
2025-03-06T17:16:50.370 app[d8dd650b3d3e48] ams [info] INFO [fly api proxy] listening at /.fly/api
2025-03-06T17:16:50.394 runner[d8dd650b3d3e48] ams [info] Machine created and started in 24.023s
2025-03-06T17:16:50.696 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:16:50.698 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:02Z"
2025-03-06T17:16:50.766 app[d8dd650b3d3e48] ams [info] 2025/03/06 17:16:50 INFO SSH listening listen_address=[fdaa:9:e08:a7b:3b:e7b:f638:2]:22
2025-03-06T17:16:51.462 app[d8dd650b3d3e48] ams [info] > dealday@0.0.2 start
2025-03-06T17:16:51.462 app[d8dd650b3d3e48] ams [info] > NODE_ENV=production NODE_OPTIONS='--import ./instrumentation.server.mjs' PORT=3000 HOST=0.0.0.0 node ./server.js
2025-03-06T17:16:51.708 proxy[d8dd650b3d3e48] ams [error] [PC01] instance refused connection. is your app listening on 0.0.0.0:3000? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2025-03-06T17:16:53.716 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:16:53.717 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:02Z"
2025-03-06T17:16:55.726 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:16:55.728 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:06Z"
2025-03-06T17:16:57.738 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:16:57.739 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:06Z"
2025-03-06T17:16:58.285 app[d8dd650b3d3e48] ams [info] App listening on 0.0.0.0:3000
2025-03-06T17:16:58.743 proxy[78190eb9753198] fra [error] [PR04] could not find a good candidate within 20 attempts at load balancing
2025-03-06T17:16:59.089 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:16:59.090 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:11Z"
2025-03-06T17:16:59.163 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:16:59.164 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:11Z"
2025-03-06T17:16:59.456 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:16:59.457 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:11Z"
2025-03-06T17:16:59.877 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:16:59.878 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "rate limit exceeded"
2025-03-06T17:16:59.945 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:16:59.946 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "rate limit exceeded"
2025-03-06T17:17:00.182 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:00.183 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:11Z"
2025-03-06T17:17:00.455 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:00.455 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "rate limit exceeded"
2025-03-06T17:17:01.006 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:01.007 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "rate limit exceeded"
2025-03-06T17:17:02.361 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:02.362 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:15Z"
2025-03-06T17:17:02.929 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:02.931 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:15Z"
2025-03-06T17:17:04.371 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:04.373 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:15Z"
2025-03-06T17:17:04.939 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:04.941 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:15Z"
2025-03-06T17:17:06.382 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:06.384 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:15Z"
2025-03-06T17:17:06.913 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:06.915 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:19Z"
2025-03-06T17:17:08.392 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:08.393 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:19Z"
2025-03-06T17:17:08.923 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:08.925 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:19Z"
2025-03-06T17:17:09.128 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:09.129 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:19Z"
2025-03-06T17:17:09.228 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:09.228 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "rate limit exceeded"
2025-03-06T17:17:09.520 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:09.520 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "rate limit exceeded"
2025-03-06T17:17:10.268 app[d8dd650b3d3e48] ams [info] GET /healthcheck 200 - 39 ms - 172.19.15.153
2025-03-06T17:17:10.470 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:10.471 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "machine ID 78190eb9753198 lease currently held by 925d2b55-0e0d-5e44-9e28-20931b8dc5a4@tokens.fly.io, expires at 2025-03-06T17:17:19Z"
2025-03-06T17:17:10.585 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:10.585 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "rate limit exceeded"
2025-03-06T17:17:10.652 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:10.652 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "rate limit exceeded"
2025-03-06T17:17:10.808 health[d8dd650b3d3e48] ams [info] Health check on port 3000 is now passing.
2025-03-06T17:17:10.899 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:10.899 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "rate limit exceeded"
2025-03-06T17:17:10.934 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:10.934 proxy[78190eb9753198] ams [error] [PM01] machines API returned an error: "rate limit exceeded"
2025-03-06T17:17:12.005 proxy[78190eb9753198] ams [info] Starting machine
2025-03-06T17:17:12.365 app[78190eb9753198] ams [info] 2025-03-06T17:17:12.365683536 [01JNP7AQY260F3KVNXAAN7X89W:main] Running Firecracker v1.7.0
2025-03-06T17:17:13.257 app[78190eb9753198] ams [info] INFO Starting init (commit: d15e62a13)...
2025-03-06T17:17:13.394 app[78190eb9753198] ams [info] INFO Preparing to run: `docker-entrypoint.sh npm run start` as root
2025-03-06T17:17:13.409 app[78190eb9753198] ams [info] INFO [fly api proxy] listening at /.fly/api
2025-03-06T17:17:13.549 runner[78190eb9753198] ams [info] Machine started in 1.455s
2025-03-06T17:17:13.593 proxy[78190eb9753198] ams [info] machine started in 1.587528983s
2025-03-06T17:17:13.699 app[78190eb9753198] ams [info] 2025/03/06 17:17:13 INFO SSH listening listen_address=[fdaa:9:e08:a7b:c988:b344:5ff3:2]:22
2025-03-06T17:17:14.527 app[78190eb9753198] ams [info] > dealday@0.0.2 start
2025-03-06T17:17:14.527 app[78190eb9753198] ams [info] > NODE_ENV=production NODE_OPTIONS='--import ./instrumentation.server.mjs' PORT=3000 HOST=0.0.0.0 node ./server.js
2025-03-06T17:17:18.918 proxy[78190eb9753198] ams [info] waiting for machine to be reachable on 0.0.0.0:3000 (waited 5.324855392s so far)
2025-03-06T17:17:19.064 proxy[78190eb9753198] ams [info] waiting for machine to be reachable on 0.0.0.0:3000 (waited 5.471371569s so far)
2025-03-06T17:17:19.094 proxy[78190eb9753198] ams [info] waiting for machine to be reachable on 0.0.0.0:3000 (waited 5.50110571s so far)
2025-03-06T17:17:20.400 app[78190eb9753198] ams [info] App listening on 0.0.0.0:3000
2025-03-06T17:17:21.028 proxy[78190eb9753198] ams [info] machine became reachable in 7.435439141s
2025-03-06T17:17:21.078 app[78190eb9753198] ams [info] GET /healthcheck 200 - 38 ms - 35.180.47.6
2025-03-06T17:17:21.099 proxy[78190eb9753198] ams [info] machine became reachable in 7.506403021s

It is unclear to me 1. how I can have any influence over what fly proxy is doing and the errors it is throwing 2. why those errors would result in even the old deployment no longer working and the app returning a 502.

Any help in figuring out this (supposed) proxy issue would be much appreciated, as going back to a bluegreen deployment is not an option for me.

Also I don’t understand why it’s giving me instance refused connection. is your app listening on 0.0.0.0:3000? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on), when the app is clearly listening on that port, just not instantaneously. Is there a way to configure a grace period for this initial check?

2025-03-06T17:16:39.667 health[d8dd650b3d3e48] ams [error] Health check on port 3000 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.

Do you have an endpoint at /healthcheck that produces a 2xx response?

Yes, there are later messages indicating a valid healthcheck:
2025-03-06T17:17:10.808 health[d8dd650b3d3e48] ams [info] Health check on port 3000 is now passing.

But as you can see by the order of messages, the first healthcheck is attempted before the init is even issued, which makes no sense and I’ve always just attributed to a fluke in fly’s system.

Also even if that were the case to me it doesn’t explain why fly proxy would return a 502, since the old machine is still up and running.

could not find a good candidate within 20 attempts at load balancing is the message that troubles me. Again how can that be the case with my setup?

My app is configured for two instances. Due to auto-scaling only one is started right now. It seems as if fly might be attempting to replace the running one instead of stopped one

In this forum, some folks recently mentioned that the health-checks not being respected, and thus their multi-machine deployments were having some downtime. However a cursory search just now does not let me find those messages; maybe you’ll have more luck.

I do agree this should work, and I’ll be interested in your progress.

One thing you could try: when your app is in a healthy multi-machine state, create a new machine containing a new software release, but don’t start it. Then cordon it, start it, wait for the health-check to go green by monitoring it yourself, and then un-cordoning it. I think it would merge into the load-balanced app, and then you could stop one of the original machines containing the old software.

This is a bit of a faff, but I think it would work, if you run out of more straightforward options.

Surely enough, when I force both machines to stay up by bumping the min_machines_running to 2 in the http_service config, a deploy does not cause any downtime, even with those same healthcheck errors. So I think while not ideal that’s actually not related. It seems to me that this is a genuine bug with fly where during a rolling update, it kills running machines instead of stopped ones.

2 Likes

Fly support has confirmed that the behaviour I’m seeing, while unfortunate, is correct. You can’t have a rolling deployment with a min number of machines of 1 and not have downtime. Something that should definitely be noted somewhere.

1 Like

Good to note. You can add a script to scale your app to 2 before your deployment, then reset it back to 1 after success.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.