Requests to app hang while waiting for old instances to shut down

Foreword: I have reproduced this across multiple apps (i.e: totally different fly apps with totally different codebases), as far as I can tell it’s either an issue with my Fly configuration, or Fly itself, it doesn’t appear to be an issue with my apps. I doubt it’s as simple as this, but it’s as if the Fly load balancer logic is “…do not send requests while there are instances shutting down…”.


During the deployment process, my new instances come online and after a few seconds they pass their health checks, then the old instances start to shut down… and while the old instances are shutting down, all of the requests to the app hang, as if the proxy believes there are no instances available. Once the old instances disappear, the app starts responding to requests again.

The service I’m using for this example currently has a minimum instance count of 2 (max: 10) and bluegreen deploy but this occurs with the default deploy method and 1 instance too.

I wrote a script that will log the fly status every 5 seconds, then make a request to the service and log the result. I ran the script while running a deploy and observed the following based on the output of fly status:

  1. As the deploy is starting, the 2 existing instances have a status of running and a DesiredStatus of run
  2. As the deploy is finishing, an additional 2 instances (of the new version) are brought online and once their health checks are passing (status is passing) I can see there are now 4 instances (2 of old, 2 of new) each with a status of running and DesiredStatus of run: the service continues to respond to http requests for the next ~10 seconds
  3. The 2 old version instances begin to shut down: the status is running and the DesiredStatus is stop… and now all HTTP requests to the service are hanging, no requests are being received by any of my instances (per the “Monitoring” for the app)
  4. After about a minute, once the old instances are gone, any pending requests receive a response

Essentially, as soon as “Starting clean up” appears in the logs, requests start hanging:

2022-12-08T17:48:36.058 runner[4fce3290] lhr [info] Shutting down virtual machine
2022-12-08T17:48:36.151 app[4fce3290] lhr [info] Sending signal SIGINT to main child process w/ PID 520
2022-12-08T17:48:36.158 runner[5ae1ff91] lhr [info] Shutting down virtual machine
2022-12-08T17:48:36.255 app[5ae1ff91] lhr [info] Sending signal SIGINT to main child process w/ PID 520
2022-12-08T17:48:36.424 app[4fce3290] lhr [info] Starting clean up.
2022-12-08T17:48:37.227 app[5ae1ff91] lhr [info] Starting clean up.
2022-12-08T17:49:25.856 app[2c56cf19] lhr [info] (my app output)

Note the 48 seconds between the last 2 lines (“Starting clean up” and “(my app output)”):

2022-12-08T17:48:37.227 app[5ae1ff91] lhr [info] Starting clean up.
2022-12-08T17:49:25.856 app[2c56cf19] lhr [info] (my app output)

There’s nothing that jumps out at me when looking at the output of fly status. The output of fly status is the same regardless of whether the requests are failing or succeeding. I did notice that “CreatedAt” is wrong and Successful is false despite the status being successful but maybe that’s just deprecated fields or something.

    "DeploymentStatus": {
        "ID": "ba9e7f76-093b-2503-01a6-0899515834b1",
        "Status": "successful",
        "Description": "Deployment completed successfully",
        "InProgress": false,
        "Successful": false,
        "CreatedAt": "0001-01-01T00:00:00Z",
        "Allocations": null,
        "Version": 49,
        "DesiredCount": 2,
        "PlacedCount": 2,
        "HealthyCount": 2,
        "UnhealthyCount": 0
    },

The app only takes a couple of seconds to start up (as confirmed locally) and the Fly health checks seem to be working (because they take a few to show as passing, indicating that they’re failing while the app is starting up). I’ve tried both TCP health checks (i.e: checking the port responds) and http health checks (a special health endpoint in the app).

Any ideas what the problem is? I’ve deployed this app ~50 times, and it happens every time, I’m just now investigating it properly before launch. I’ve also tried restarting the app and observed the same behaviour, likewise with adding new secrets. Every time, there’s between 30 and 90 seconds where Fly appears to hold requests despite there being available instances of the new version.

App config:

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]
# ...

[experimental]
allowed_public_ports = []
auto_rollback = true
cmd = ["dist/node/index.js"]

[deploy]
strategy = "bluegreen"

[[services]]
internal_port = 8080
processes = ["app"]
protocol = "tcp"
script_checks = []

[[services.ports]]
force_https = true
handlers = ["http"]
port = 80

[[services.ports]]
handlers = ["tls", "http"]
port = 443

[services.concurrency]
hard_limit = 100
soft_limit = 20
type = "requests"

[[services.http_checks]]
grace_period = "5s"
interval = 10000
method = "get"
path = "/__health"
protocol = "http"
restart_limit = 0
timeout = 2000
tls_skip_verify = false

Thanks,

1 Like

I’m experiencing the same issue with my app (using bluegreen too). As soon as the new instances are up and running (and healthy), the old ones get stopped, and then requests just hang for quite some time (30+ seconds).

2 Likes

I have several Fly apps deployed to prod for a few months now, and I just started seeing this issue come up over the past couple of days, so it must be some relatively new issue.

3 Likes