Healthcheck-based Routing

We now no longer route network connections to instances that are failing their healthchecks.

To demonstrate, if you have a service set up with some healthchecks:

[[services]]
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.http_checks]]
    interval = "5s"
    grace_period = "5s"
    method = "get"
    path = "/health"
    protocol = "http"
    restart_limit = 0
    timeout = 2000
    tls_skip_verify = false

and those checks start failing:

$ fly checks list
Health Checks for checks-demo
  NAME                      | STATUS   | MACHINE        | LAST UPDATED         | OUTPUT                       
----------------------------*----------*----------------*----------------------*------------------------------
  servicecheck-00-http-8080 | critical | 5683ddd7c69448 | 2023-05-16T01:06:21Z | connect: connection refused  
----------------------------*----------*----------------*----------------------*------------------------------

then network connections will not be routed to that instance. If you have another instance that is healthy, we will route to it instead. Otherwise, the connection will block waiting for the bad instance to become healthy.

Note that this change does not apply to top-level healthchecks defined in a [checks] section. This only works with checks defined under [[services.*_checks]] . The top-level checks are not used for routing because we don’t know which service they apply to.

Also, this only works through our proxy, i.e. connections that come in from the public internet or via .flycast domains. .internal domains will bypass any healthchecks.

Why should I care?

If you’re looking to make your app highly available, and reduce the amount of potential downtime, this is for you. There are many reasons why you might want to take an instance out of rotation:

  • the instance is overloaded and cannot respond to new requests quickly
  • the underlying host has an issue
  • the network is having problems
  • the instance is still busy starting up and is not ready to serve requests
  • an upstream dependency is having issues and you can’t live without it (e.g. a DB, or a third-party API)

If you can test any of those things via a healthcheck, you can guarantee that only healthy instances will be handling traffic.

(We also avoid routing to instances that are failing to respond on the internal_port, or exceeding their connection limit, and have been for a while.)

6 Likes

Is it possible to have the deployment mechanism ignore a particular healthcheck? E.g. You have a proxy http healthcheck and a non-proxy tcp healthcheck and you want the deploy to only look at non-proxy tcp healthcheck to determine aliveness as the proxy http healthcheck doesn’t necessarily represent aliveness but you don’t want the proxy to send the VM requests until it’s ready. Also the http service could come up and down without the non-proxy tcp healthcheck changing.

2 Likes

Not at the moment, but good to know your use case. Are you thinking of something like how Kubernetes has separate liveness and readiness probes?

3 Likes

Yes, that sounds like what I’m after.

My particular use case would be with CockroachDB as for a variety of reasons a particular VM in a cluster is not in a state to accept SQL queries from clients but it’s still participating in the cluster and so would be bad stability for a deploy/machines system to kill it or restart it.

2 Likes

Ah, so close. Any plans to introduce routing decisions based on top-level [checks], too? For our app, I’d not mind all services of a VM considered dead when top-level [checks] fail.

I’ve just read

which says load-balancing is cross-region (I don’t remember seeing that before), does the same apply here?

I once saw a region become unable due to managed Redis connection problems (back in the apps v1 days!), and requests there just hung rather than being routed elsewhere. As I now also have a Redis connection test in my healthcheck, am I right in thinking that a reoccurrence like that should now see it routed elsewhere?

1 Like

I had the same question. The answer is yes:

1 Like

:tada:

So it should be safe to have a single instance in a region (during deploys and downtime requests get forwarded elsewhere).

And does http_service.http_checks work?

How does fly-proxy keep track of unhealthy instances?

Is it a blacklist at the fly-proxy level, or does the service discovery remove unhealthy instances from the pool? If the latter is the case, it would be awesome to have a way to filter healthy instances using DNS, like healthy.[something].internal.

1 Like

Going to reinvent Multicast DNS / DNS-SD at this rate :wink:

1 Like

I actually like the idea of exposing it since Fly already goes to the trouble of running the service discovery anyway :laughing:

Flycast addresses are awesome, but they add yet another hop/proxy that may not be really needed depending on the use case - i.e. communicating with internal services in the private space.

2 Likes

@Laurens it should work now :))
Update your flyctl to the latest version and redeploy your app

1 Like

Tested it, thanks!

3 Likes

Is there anyway to optimise the routing when an instance is “stopped”?

So I have a set up where there is one instance in the primary region, with another instance in another region (which is physically closer to me).

For the non-primary instance, I have it on auto-shutdown for now, so most of the time it’ll be in the “stopped” state.

Once the non-primary instance is stopped. If I then hit my app, instead of routing me to the running instance, the routing decides to route me to the non-primary (but closer to me), so the end result is a jarring ~5 seconds wait whilst the machine is waking up.

In this case, wouldn’t it make more sense for the router to simply route to an available instance, rather than the closest?

Thanks.

2 Likes

Hi @fredwu

It works that way to maintain closest region routing and all the benefits that come with it (such as being able to store particular data closer to the user).

If it only went to alive nodes then you’d most likely end up in a situation where your node in the non-primary region is almost never woken up as requests would keep getting sent to your primary region, at which point there’s not really any benefit of having a non-primary region.

Thanks @charsleysa , that makes sense. Though I was thinking that the initial request should go to the running instance, whilst the closest instance is being woken up - and once it does, it can then start serving subsequent requests.

As is the initial wake up time is really killing the usability…

You could see if wakeup time is optimised for your app? We run a NodeJS app, and it is ready-to-go in 600ms, from what I see in the logs.

If you’re using Fly’s HTTP handler, setup to route incoming requests to the primary until this woken up machine is ready?

Thanks for the tips @ignoramous, I’ll look into it! :pray:

1 Like