Healthcheck-based Routing

ben-io · May 19, 2023, 7:03pm

We now no longer route network connections to instances that are failing their healthchecks.

To demonstrate, if you have a service set up with some healthchecks:

[[services]]
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.http_checks]]
    interval = "5s"
    grace_period = "5s"
    method = "get"
    path = "/health"
    protocol = "http"
    restart_limit = 0
    timeout = 2000
    tls_skip_verify = false

and those checks start failing:

$ fly checks list
Health Checks for checks-demo
  NAME                      | STATUS   | MACHINE        | LAST UPDATED         | OUTPUT                       
----------------------------*----------*----------------*----------------------*------------------------------
  servicecheck-00-http-8080 | critical | 5683ddd7c69448 | 2023-05-16T01:06:21Z | connect: connection refused  
----------------------------*----------*----------------*----------------------*------------------------------

then network connections will not be routed to that instance. If you have another instance that is healthy, we will route to it instead. Otherwise, the connection will block waiting for the bad instance to become healthy.

Note that this change does not apply to top-level healthchecks defined in a [checks] section. This only works with checks defined under [[services.*_checks]] . The top-level checks are not used for routing because we don’t know which service they apply to.

Also, this only works through our proxy, i.e. connections that come in from the public internet or via .flycast domains. .internal domains will bypass any healthchecks.

Why should I care?

If you’re looking to make your app highly available, and reduce the amount of potential downtime, this is for you. There are many reasons why you might want to take an instance out of rotation:

the instance is overloaded and cannot respond to new requests quickly
the underlying host has an issue
the network is having problems
the instance is still busy starting up and is not ready to serve requests
an upstream dependency is having issues and you can’t live without it (e.g. a DB, or a third-party API)

If you can test any of those things via a healthcheck, you can guarantee that only healthy instances will be handling traffic.

(We also avoid routing to instances that are failing to respond on the internal_port, or exceeding their connection limit, and have been for a while.)

charsleysa · May 19, 2023, 10:48pm

Is it possible to have the deployment mechanism ignore a particular healthcheck? E.g. You have a proxy http healthcheck and a non-proxy tcp healthcheck and you want the deploy to only look at non-proxy tcp healthcheck to determine aliveness as the proxy http healthcheck doesn’t necessarily represent aliveness but you don’t want the proxy to send the VM requests until it’s ready. Also the http service could come up and down without the non-proxy tcp healthcheck changing.

ben-io · May 20, 2023, 12:03am

Not at the moment, but good to know your use case. Are you thinking of something like how Kubernetes has separate liveness and readiness probes?

charsleysa · May 20, 2023, 12:07am

Yes, that sounds like what I’m after.

My particular use case would be with CockroachDB as for a variety of reasons a particular VM in a cluster is not in a state to accept SQL queries from clients but it’s still participating in the cluster and so would be bad stability for a deploy/machines system to kill it or restart it.

ignoramous · May 20, 2023, 1:42am

Ah, so close. Any plans to introduce routing decisions based on top-level [checks], too? For our app, I’d not mind all services of a VM considered dead when top-level [checks] fail.

thewilkybarkid · June 13, 2023, 6:24pm

I’ve just read

which says load-balancing is cross-region (I don’t remember seeing that before), does the same apply here?

I once saw a region become unable due to managed Redis connection problems (back in the apps v1 days!), and requests there just hung rather than being routed elsewhere. As I now also have a Redis connection test in my healthcheck, am I right in thinking that a reoccurrence like that should now see it routed elsewhere?

ignoramous · June 13, 2023, 9:13pm

I had the same question. The answer is yes:

thewilkybarkid · June 13, 2023, 10:09pm

So it should be safe to have a single instance in a region (during deploys and downtime requests get forwarded elsewhere).

Laurens · June 24, 2023, 10:13am

And does http_service.http_checks work?

containerops · June 26, 2023, 2:10am

How does fly-proxy keep track of unhealthy instances?

Is it a blacklist at the fly-proxy level, or does the service discovery remove unhealthy instances from the pool? If the latter is the case, it would be awesome to have a way to filter healthy instances using DNS, like healthy.[something].internal.

ignoramous · June 26, 2023, 12:38pm

Going to reinvent Multicast DNS / DNS-SD at this rate

containerops · June 26, 2023, 4:04pm

I actually like the idea of exposing it since Fly already goes to the trouble of running the service discovery anyway

Flycast addresses are awesome, but they add yet another hop/proxy that may not be really needed depending on the use case - i.e. communicating with internal services in the private space.

kwaw · June 27, 2023, 11:05pm

@Laurens it should work now :))
Update your flyctl to the latest version and redeploy your app

Laurens · June 30, 2023, 2:31pm

Tested it, thanks!

fredwu · July 2, 2023, 3:43am

Is there anyway to optimise the routing when an instance is “stopped”?

So I have a set up where there is one instance in the primary region, with another instance in another region (which is physically closer to me).

For the non-primary instance, I have it on auto-shutdown for now, so most of the time it’ll be in the “stopped” state.

Once the non-primary instance is stopped. If I then hit my app, instead of routing me to the running instance, the routing decides to route me to the non-primary (but closer to me), so the end result is a jarring ~5 seconds wait whilst the machine is waking up.

In this case, wouldn’t it make more sense for the router to simply route to an available instance, rather than the closest?

Thanks.

charsleysa · July 2, 2023, 9:34am

Hi @fredwu

It works that way to maintain closest region routing and all the benefits that come with it (such as being able to store particular data closer to the user).

If it only went to alive nodes then you’d most likely end up in a situation where your node in the non-primary region is almost never woken up as requests would keep getting sent to your primary region, at which point there’s not really any benefit of having a non-primary region.

fredwu · July 2, 2023, 9:53am

Thanks @charsleysa , that makes sense. Though I was thinking that the initial request should go to the running instance, whilst the closest instance is being woken up - and once it does, it can then start serving subsequent requests.

As is the initial wake up time is really killing the usability…

ignoramous · July 2, 2023, 10:01am

You could see if wakeup time is optimised for your app? We run a NodeJS app, and it is ready-to-go in 600ms, from what I see in the logs.

If you’re using Fly’s HTTP handler, setup to route incoming requests to the primary until this woken up machine is ready?

fredwu · July 2, 2023, 11:49am

Thanks for the tips @ignoramous, I’ll look into it!

Topic		Replies	Views
Traffic (still) routed to instances not passing health check Questions / Help machines , autoscaling , proxy	10	139	March 7, 2025
Health check-based routing Questions / Help proxy	3	58	November 12, 2024
"Could not find an instance to route to" Phoenix	4	568	January 24, 2023
Traffic routed to instances not passing health check Questions / Help machines , proxy	5	250	June 5, 2024
Health checks on launch fails quicker than app can launch Questions / Help machines , autoscaling	13	19	July 22, 2025

Healthcheck-based Routing

Why should I care?

Related topics