Healthcheck-based Routing

We now no longer route network connections to instances that are failing their healthchecks.

To demonstrate, if you have a service set up with some healthchecks:

[[services]]
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.http_checks]]
    interval = "5s"
    grace_period = "5s"
    method = "get"
    path = "/health"
    protocol = "http"
    restart_limit = 0
    timeout = 2000
    tls_skip_verify = false

and those checks start failing:

$ fly checks list
Health Checks for checks-demo
  NAME                      | STATUS   | MACHINE        | LAST UPDATED         | OUTPUT                       
----------------------------*----------*----------------*----------------------*------------------------------
  servicecheck-00-http-8080 | critical | 5683ddd7c69448 | 2023-05-16T01:06:21Z | connect: connection refused  
----------------------------*----------*----------------*----------------------*------------------------------

then network connections will not be routed to that instance. If you have another instance that is healthy, we will route to it instead. Otherwise, the connection will block waiting for the bad instance to become healthy.

Note that this change does not apply to top-level healthchecks defined in a [checks] section. This only works with checks defined under [[services.*_checks]] . The top-level checks are not used for routing because we don’t know which service they apply to.

Also, this only works through our proxy, i.e. connections that come in from the public internet or via .flycast domains. .internal domains will bypass any healthchecks.

Why should I care?

If you’re looking to make your app highly available, and reduce the amount of potential downtime, this is for you. There are many reasons why you might want to take an instance out of rotation:

  • the instance is overloaded and cannot respond to new requests quickly
  • the underlying host has an issue
  • the network is having problems
  • the instance is still busy starting up and is not ready to serve requests
  • an upstream dependency is having issues and you can’t live without it (e.g. a DB, or a third-party API)

If you can test any of those things via a healthcheck, you can guarantee that only healthy instances will be handling traffic.

(We also avoid routing to instances that are failing to respond on the internal_port, or exceeding their connection limit, and have been for a while.)

5 Likes

Is it possible to have the deployment mechanism ignore a particular healthcheck? E.g. You have a proxy http healthcheck and a non-proxy tcp healthcheck and you want the deploy to only look at non-proxy tcp healthcheck to determine aliveness as the proxy http healthcheck doesn’t necessarily represent aliveness but you don’t want the proxy to send the VM requests until it’s ready. Also the http service could come up and down without the non-proxy tcp healthcheck changing.

2 Likes

Not at the moment, but good to know your use case. Are you thinking of something like how Kubernetes has separate liveness and readiness probes?

3 Likes

Yes, that sounds like what I’m after.

My particular use case would be with CockroachDB as for a variety of reasons a particular VM in a cluster is not in a state to accept SQL queries from clients but it’s still participating in the cluster and so would be bad stability for a deploy/machines system to kill it or restart it.

1 Like

Ah, so close. Any plans to introduce routing decisions based on top-level [checks], too? For our app, I’d not mind all services of a VM considered dead when top-level [checks] fail.