Foreword: I have reproduced this across multiple apps (i.e: totally different fly apps with totally different codebases), as far as I can tell it’s either an issue with my Fly configuration, or Fly itself, it doesn’t appear to be an issue with my apps. I doubt it’s as simple as this, but it’s as if the Fly load balancer logic is “…do not send requests while there are instances shutting down…”.
During the deployment process, my new instances come online and after a few seconds they pass their health checks, then the old instances start to shut down… and while the old instances are shutting down, all of the requests to the app hang, as if the proxy believes there are no instances available. Once the old instances disappear, the app starts responding to requests again.
The service I’m using for this example currently has a minimum instance count of 2 (max: 10) and bluegreen
deploy but this occurs with the default deploy method and 1 instance too.
I wrote a script that will log the fly status
every 5 seconds, then make a request to the service and log the result. I ran the script while running a deploy and observed the following based on the output of fly status
:
- As the deploy is starting, the 2 existing instances have a status of
running
and aDesiredStatus
ofrun
- As the deploy is finishing, an additional 2 instances (of the new version) are brought online and once their health checks are passing (
status
ispassing
) I can see there are now 4 instances (2 of old, 2 of new) each with a status ofrunning
andDesiredStatus
ofrun
: the service continues to respond to http requests for the next ~10 seconds - The 2 old version instances begin to shut down: the status is
running
and theDesiredStatus
isstop
… and now all HTTP requests to the service are hanging, no requests are being received by any of my instances (per the “Monitoring” for the app) - After about a minute, once the old instances are gone, any pending requests receive a response
Essentially, as soon as “Starting clean up” appears in the logs, requests start hanging:
2022-12-08T17:48:36.058 runner[4fce3290] lhr [info] Shutting down virtual machine
2022-12-08T17:48:36.151 app[4fce3290] lhr [info] Sending signal SIGINT to main child process w/ PID 520
2022-12-08T17:48:36.158 runner[5ae1ff91] lhr [info] Shutting down virtual machine
2022-12-08T17:48:36.255 app[5ae1ff91] lhr [info] Sending signal SIGINT to main child process w/ PID 520
2022-12-08T17:48:36.424 app[4fce3290] lhr [info] Starting clean up.
2022-12-08T17:48:37.227 app[5ae1ff91] lhr [info] Starting clean up.
2022-12-08T17:49:25.856 app[2c56cf19] lhr [info] (my app output)
Note the 48 seconds between the last 2 lines (“Starting clean up” and “(my app output)”):
2022-12-08T17:48:37.227 app[5ae1ff91] lhr [info] Starting clean up.
2022-12-08T17:49:25.856 app[2c56cf19] lhr [info] (my app output)
There’s nothing that jumps out at me when looking at the output of fly status
. The output of fly status
is the same regardless of whether the requests are failing or succeeding. I did notice that “CreatedAt” is wrong and Successful
is false
despite the status being successful
but maybe that’s just deprecated fields or something.
"DeploymentStatus": {
"ID": "ba9e7f76-093b-2503-01a6-0899515834b1",
"Status": "successful",
"Description": "Deployment completed successfully",
"InProgress": false,
"Successful": false,
"CreatedAt": "0001-01-01T00:00:00Z",
"Allocations": null,
"Version": 49,
"DesiredCount": 2,
"PlacedCount": 2,
"HealthyCount": 2,
"UnhealthyCount": 0
},
The app only takes a couple of seconds to start up (as confirmed locally) and the Fly health checks seem to be working (because they take a few to show as passing, indicating that they’re failing while the app is starting up). I’ve tried both TCP health checks (i.e: checking the port responds) and http health checks (a special health endpoint in the app).
Any ideas what the problem is? I’ve deployed this app ~50 times, and it happens every time, I’m just now investigating it properly before launch. I’ve also tried restarting the app and observed the same behaviour, likewise with adding new secrets. Every time, there’s between 30 and 90 seconds where Fly appears to hold requests despite there being available instances of the new version.
App config:
kill_signal = "SIGINT"
kill_timeout = 5
processes = []
[env]
# ...
[experimental]
allowed_public_ports = []
auto_rollback = true
cmd = ["dist/node/index.js"]
[deploy]
strategy = "bluegreen"
[[services]]
internal_port = 8080
processes = ["app"]
protocol = "tcp"
script_checks = []
[[services.ports]]
force_https = true
handlers = ["http"]
port = 80
[[services.ports]]
handlers = ["tls", "http"]
port = 443
[services.concurrency]
hard_limit = 100
soft_limit = 20
type = "requests"
[[services.http_checks]]
grace_period = "5s"
interval = 10000
method = "get"
path = "/__health"
protocol = "http"
restart_limit = 0
timeout = 2000
tls_skip_verify = false
Thanks,