HTTP Health checks failing, but not restarting app

Hi!

I am not sure if I have understood HTTP health check correctly, but i have this kind of configuration:

  [[services.http_checks]]
    interval = 10000
    method = "get"
    path = "/healthcheck"
    protocol = "http"
    timeout = 5000
    tls_skip_verify = false

I have no TCP checks defined.

Instance status/health checks is currently: running/1 total, 1 critical. I assumed Fly would restart my app if all(?) health checks fail, but it looks like restart does not happen.

flyctl checks list:

NAME                             STATUS   ALLOCATION REGION TYPE LAST UPDATED OUTPUT
5c800e7c9d8a343831f802ff4147a8ff critical c5fe99a5   lhr    HTTP 6m6s ago     HTTP GET
                                                                              http://172.19.2.2:3000/healthcheck:
                                                                              503 Service Unavailable Output:
                                                                              {"error":"internal error"}
5c800e7c9d8a343831f802ff4147a8ff critical 97162895   fra    HTTP 13s ago      HTTP GET
                                                                              http://172.19.1.130:3000/healthcheck:
                                                                              503 Service Unavailable Output:
                                                                              {"error":"internal error"}

So, on both instances health check is failing.

What I would want would be that Fly would restart instance on this situation - is that possible somehow?

I might have understood now this:

  • When instance has launched successfully and then http check fails, it will restart
  • If instance starts and check will not pass, it keeps on running, no restarts

As our app needs restart on specific situations, I have now implemented restart inside app as a workaround.

Try adding a restart_limit = 6 to your check. This will make the service restart after 6 consecutive failures.

You didn’t really misunderstand:

  1. On deploy, checks have to pass to allow the deploy to continue. If a check fails, we restart a couple of times to make sure the error wasn’t transient
  2. The restart_limit option controls restart after a VM is successfully deployed.

Restarting in the app is usually better, so your workaround might be worth keeping. The problem we had with restarting on checks is that when a backend resource fails, all the VMs might fail health checks at the same time.

Oh thanks. As an extra request, could you add that to docs, currently seems to be missing from there.

Still, I need to think whether I will add that or keep this workaround. It seems to work now so better not to fix that :slight_smile: