I am baffled by the http_service.checks functionality in fly.toml. I have several apps running with no issues when I don’t define any http_service.checks. When I do define a check, even with excessively long timeouts, the check invariably fails, with no indication of what went wrong. The logs show a bunch of unreadable garbage, e.g.:
What this will do is, it will hit the /check/healthy endpoint in the application every minute. If it gets a success code (2xx) it will consider the machine healthy and keep routing traffic to it. Otherwise, it will log the failure and stop sending traffic to this machine, possibly routing it to other available machines in the same application. In that sense it’s somewhat similar to Kubernetes’s readiness probe settings.
“timeout” indicates how long to wait to get a successful response.
“grace_period” indicates how long to wait after the application has just started before starting to perform health checks. This allows your application’s startup to complete and for it to be ready to serve requests.
There are more settings you can tweak, they are explained here.
What is the use case of http service check?
The http_service checks is simply a way to check the health of your app. It helps by letting you know that your app is in good shape.
How are they supposed to work?
It works by making a request to an endpoint that you specify in the .toml config
just like the example and illustration given by @roadmr
What are they used for?
Well, many things but one of the major usecase is to aid your deployment strategy, since fly provides four (4) strategy to deploy your app as explained here: https://fly.io/docs/apps/deploy/#deployment-strategy
The health check works mostly for the bluegreen deployment strategy, which means it boots a new Machine alongside each currently running Machine, and migrate traffic to the new Machines only once all the new Machines pass health checks
This is a working config file that has health_check included. Notice the formatting
This is just an nginx server acting as a reverse proxy, so it starts up fast and there’s not much to go wrong. And nothing does go wrong if I deploy without the health check and then use curl to make the request described above:
No, but removing those two lines does turn the weird garbage in the logs into meaningful text, so that’s an improvement…
The nginx server is caching, and the cache is kept on a mounted volume. Looking at the logs, there seems to be a permissions problem reading the volume—but this only happens when the health check is enabled:
2023-10-04T00:04:22.369 app[3d8d9222f96328] iad [info] 172.19.7.57 - - [04/Oct/2023:00:04:22 +0000] "GET /bags/ HTTP/1.1" 404 536 "-" "Consul Health Check" "-"
2023-10-04T00:04:23.176 health[3d8d9222f96328] iad [error] Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
2023-10-04T00:05:01.494 app[3d8d9222f96328] iad [info] 2023/10/04 00:05:01 [crit] 341#341: opendir() "/mnt/cache/lost+found" failed (13: Permission denied)
2023-10-04T00:05:01.500 app[3d8d9222f96328] iad [info] 2023/10/04 00:05:01 [notice] 341#341: http file cache: /mnt/cache 62.895M, bsize: 4096
2023-10-04T00:05:01.519 app[3d8d9222f96328] iad [info] 2023/10/04 00:05:01 [notice] 317#317: signal 17 (SIGCHLD) received from 341
2023-10-04T00:05:01.519 app[3d8d9222f96328] iad [info] 2023/10/04 00:05:01 [notice] 317#317: cache loader process 341 exited with code 0
2023-10-04T00:05:01.519 app[3d8d9222f96328] iad [info] 2023/10/04 00:05:01 [notice] 317#317: signal 29 (SIGIO) received
The significant difference in behavior between “with health check” and “without health check” is why I am trying to learn more about the details of when and how the health checks are run. Maybe they are run before volumes are properly mounted?