We now no longer route network connections to instances that are failing their healthchecks.
To demonstrate, if you have a service set up with some healthchecks:
[[services]]
internal_port = 8080
processes = ["app"]
protocol = "tcp"
[[services.ports]]
force_https = true
handlers = ["http"]
port = 80
[[services.ports]]
handlers = ["tls", "http"]
port = 443
[[services.http_checks]]
interval = "5s"
grace_period = "5s"
method = "get"
path = "/health"
protocol = "http"
restart_limit = 0
timeout = 2000
tls_skip_verify = false
and those checks start failing:
$ fly checks list
Health Checks for checks-demo
NAME | STATUS | MACHINE | LAST UPDATED | OUTPUT
----------------------------*----------*----------------*----------------------*------------------------------
servicecheck-00-http-8080 | critical | 5683ddd7c69448 | 2023-05-16T01:06:21Z | connect: connection refused
----------------------------*----------*----------------*----------------------*------------------------------
then network connections will not be routed to that instance. If you have another instance that is healthy, we will route to it instead. Otherwise, the connection will block waiting for the bad instance to become healthy.
Note that this change does not apply to top-level healthchecks defined in a [checks]
section. This only works with checks defined under [[services.*_checks]]
. The top-level checks are not used for routing because we don’t know which service they apply to.
Also, this only works through our proxy, i.e. connections that come in from the public internet or via .flycast
domains. .internal
domains will bypass any healthchecks.
Why should I care?
If you’re looking to make your app highly available, and reduce the amount of potential downtime, this is for you. There are many reasons why you might want to take an instance out of rotation:
- the instance is overloaded and cannot respond to new requests quickly
- the underlying host has an issue
- the network is having problems
- the instance is still busy starting up and is not ready to serve requests
- an upstream dependency is having issues and you can’t live without it (e.g. a DB, or a third-party API)
If you can test any of those things via a healthcheck, you can guarantee that only healthy instances will be handling traffic.
(We also avoid routing to instances that are failing to respond on the internal_port, or exceeding their connection limit, and have been for a while.)