Working example of http_service.checks?

I am baffled by the http_service.checks functionality in fly.toml. I have several apps running with no issues when I don’t define any http_service.checks. When I do define a check, even with excessively long timeouts, the check invariably fails, with no indication of what went wrong. The logs show a bunch of unreadable garbage, e.g.:

2023-10-03T12:38:10.580 app[3d8d9222f96328] iad [info] 172.19.7.57 - - [03/Oct/2023:12:38:10 +0000] "\x16\x03\x01\x01\x08\x01\x00\x01\x04\x03\x03\xF3\xE9d0\xFD\xFF\x0C\xABiO\xA2YpM\xDC\xBA\x84\xA7\xC2\xCB\xEA\xA5\xD9\xB8\xF2w\x7F\x81\xC2\x00\x16V \xA6|" 400 157 "-" "-" "-"
2023-10-03T12:38:40.582 app[3d8d9222f96328] iad [info] 172.19.7.57 - - [03/Oct/2023:12:38:40 +0000] "\x16\x03\x01\x01\x08\x01\x00\x01\x04\x03\x03\xBD\xA5\xD3\xECz\xAD,\x9D3^EDvQ\x83\xB1rt\x9F\x9F\x88\xDE\x8E\xF9\xAF1aSb\x89\xFFu \x0E\xF1^\x072\x80\xB2Bqk\x8E\xA7z\x12}\x82N\x11\xCB@\xD9\xED\xC0\xEC\xF9L\xDF8\x12x\xE1\x12\x00&\xC0+\xC0/\xC0,\xC00\xCC\xA9\xCC\xA8\xC0\x09\xC0\x13\xC0" 400 157 "-" "-" "-"
2023-10-03T12:39:10.584 app[3d8d9222f96328] iad [info] 172.19.7.57 - - [03/Oct/2023:12:39:10 +0000] "\x16\x03\x01\x01\x08\x01\x00\x01\x04\x03\x03\xD0\xDC\xF4\x951\xF0\x95e\x8A\x90\xC0V\x1Ewe`Z\xFD\xD5\xDC{\xECc\x10\xEF:q1]\xEE\x17^ \xCA\xC0\xA4\xB1\xDA\xB3\x0CI\xAC\xB2\x85\x08\xA6I\xB1\xCE\xE0\xE6[\xA5p\xC5\xBD,\xF8\xD6\xFD\x1A?\xA9\x06*\x00&\xC0+\xC0/\xC0,\xC00\xCC\xA9\xCC\xA8\xC0\x09\xC0\x13\xC0" 400 157 "-" "-" "-"
2023-10-03T12:39:40.585 app[3d8d9222f96328] iad [info] 172.19.7.57 - - [03/Oct/2023:12:39:40 +0000] "\x16\x03\x01\x01\x08\x01\x00\x01\x04\x03\x03\xFB\xCDz\xA8\x11\x9A\xCD\x02JV\xB8\xEF\xE4A\xF9\xE9`H\xD4\xCE\x0B}k^T\x971\xD9\xB3\x14\xD6\xA2 \x98+\xB1=\x07j{" 400 157 "-" "-" "-"
2023-10-03T12:40:10.586 app[3d8d9222f96328] iad [info] 172.19.7.57 - - [03/Oct/2023:12:40:10 +0000] "\x16\x03\x01\x01\x08\x01\x00\x01\x04\x03\x03j\xA2\x84KY@\xE7k\x02&M}\xA5E\xBC\xCB\x1D/\xA3!.P\x9E\xAD\x22\x05\x98\xEA\x0E\x8Fv, @\x01%$\x03\x9C\xCBsr\xDB\xF8\xA4l\xDDg\x06\xEF\x84\xD4\xFD\x06\xC8t\x95\xE7\x09{\x98\x9ERZ\x90\x00&\xC0+\xC0/\xC0,\xC00\xCC\xA9\xCC\xA8\xC0\x09\xC0\x13\xC0" 400 157 "-" "-" "-"
2023-10-03T12:40:40.588 app[3d8d9222f96328] iad [info] 172.19.7.57 - - [03/Oct/2023:12:40:40 +0000] "\x16\x03\x01\x01\x08\x01\x00\x01\x04\x03\x03\xAB\x10\xAE\x9Bl\x8AX:\xBB\x8E\xF2)\xE0\x84\x02\xC1\x80\xE4\xB7\xE7\xC5\x08\xD0\xA8PsF\x1E\xAC\xD3t\x9F \x87\xB4E\x1F\x83\xE1\x95\xE8/\x1Ch2\x0B_v\xA0\xB1\xF1\xE3\xBB7o\xE5\xEE\xF7[\x9F\xAE\xAF|6\xC2\x00&\xC0+\xC0/\xC0,\xC00\xCC\xA9\xCC\xA8\xC0\x09\xC0\x13\xC0" 400 157 "-" "-" "-"

Removing the health check and re-deploying fixes everything.

It leaves me feeling that

  1. I do not understand how Fly health checks are supposed to work, and
  2. I do not understand what Fly health checks are supposed to be used for.

I am looking for a replacement for the Rackspace monitoring alerts that I used when I hosted my apps there. Those worked great!

Hi!

Can you share how you defined the http_service.checks in your fly.toml if the below doesn’t help you get it working?

Here’s a working example:

  [[http_service.checks]]
  grace_period = "10s"
  interval = "1m0s"
  method = "GET"
  path = "/check/healthy"
  timeout = "5s"

What this will do is, it will hit the /check/healthy endpoint in the application every minute. If it gets a success code (2xx) it will consider the machine healthy and keep routing traffic to it. Otherwise, it will log the failure and stop sending traffic to this machine, possibly routing it to other available machines in the same application. In that sense it’s somewhat similar to Kubernetes’s readiness probe settings.

“timeout” indicates how long to wait to get a successful response.

“grace_period” indicates how long to wait after the application has just started before starting to perform health checks. This allows your application’s startup to complete and for it to be ready to serve requests.

There are more settings you can tweak, they are explained here.

1 Like

What is the use case of http service check?
The http_service checks is simply a way to check the health of your app. It helps by letting you know that your app is in good shape.

How are they supposed to work?
It works by making a request to an endpoint that you specify in the .toml config
just like the example and illustration given by @roadmr

What are they used for?
Well, many things but one of the major usecase is to aid your deployment strategy, since fly provides four (4) strategy to deploy your app as explained here: https://fly.io/docs/apps/deploy/#deployment-strategy
The health check works mostly for the bluegreen deployment strategy, which means it boots a new Machine alongside each currently running Machine, and migrate traffic to the new Machines only once all the new Machines pass health checks

This is a working config file that has health_check included.
Notice the formatting

app = "xxxxxx-name"
primary_region = "atl"

[deploy]
strategy = "bluegreen" - //Change this to your preferred strategy

[http_service]
  internal_port = 443
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ["app"]
  [[http_service.checks]]
  grace_period = "10s"
  interval = "30s"
  method = "GET"
  timeout = "5s"
  path = "/path-to-route-that-returns-200-and-an-OK"

I hope this helps a bit

Here’s how I had the check defined:

[[http_service.checks]]
  grace_period = "10s"
  interval = "30s"
  method = "GET"
  timeout = "5s"
  path = "/bags/"
  protocol = "https"
  tls_server_name = "data.staging.perio.do"

This is just an nginx server acting as a reverse proxy, so it starts up fast and there’s not much to go wrong. And nothing does go wrong if I deploy without the health check and then use curl to make the request described above:

$ curl -i https://data.staging.perio.do/bags/
HTTP/2 200
server: Fly/e440b950 (2023-09-20)
date: Tue, 03 Oct 2023 23:16:57 GMT
content-type: application/json
content-length: 3
x-cache-status: MISS
x-periodo-server-version: 1.0.0-5-ga495be9
access-control-allow-origin: *
access-control-allow-headers: If-Modified-Since, Authorization, Content-Type
access-control-expose-headers: Last-Modified, Location, Link, X-Total-Count, X-PeriodO-Server-Version
access-control-allow-methods: GET, POST, PATCH, HEAD, OPTIONS
via: 1.0 fly.io, 2 fly.io
fly-request-id: 01HBVX90RPY1NSB9P1RRAN0PNG-iad

[]

Plus, those weird log messages only appear when the health check is enabled.

All indications are that I am operating on incorrect assumptions about the environment within which health checks run—but I don’t know how to correct them.

Does it work if you remove these two lines?

protocol = "https"
tls_server_name = "data.staging.perio.do"

No, but removing those two lines does turn the weird garbage in the logs into meaningful text, so that’s an improvement…

The nginx server is caching, and the cache is kept on a mounted volume. Looking at the logs, there seems to be a permissions problem reading the volume—but this only happens when the health check is enabled:

2023-10-04T00:04:22.369 app[3d8d9222f96328] iad [info] 172.19.7.57 - - [04/Oct/2023:00:04:22 +0000] "GET /bags/ HTTP/1.1" 404 536 "-" "Consul Health Check" "-"
2023-10-04T00:04:23.176 health[3d8d9222f96328] iad [error] Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
2023-10-04T00:05:01.494 app[3d8d9222f96328] iad [info] 2023/10/04 00:05:01 [crit] 341#341: opendir() "/mnt/cache/lost+found" failed (13: Permission denied)
2023-10-04T00:05:01.500 app[3d8d9222f96328] iad [info] 2023/10/04 00:05:01 [notice] 341#341: http file cache: /mnt/cache 62.895M, bsize: 4096
2023-10-04T00:05:01.519 app[3d8d9222f96328] iad [info] 2023/10/04 00:05:01 [notice] 317#317: signal 17 (SIGCHLD) received from 341
2023-10-04T00:05:01.519 app[3d8d9222f96328] iad [info] 2023/10/04 00:05:01 [notice] 317#317: cache loader process 341 exited with code 0
2023-10-04T00:05:01.519 app[3d8d9222f96328] iad [info] 2023/10/04 00:05:01 [notice] 317#317: signal 29 (SIGIO) received

Health checks need a 2xx response code to pass. The first line of logs says:

2023-10-04T00:04:22.369 app[3d8d9222f96328] iad [info] 172.19.7.57 - - [04/Oct/2023:00:04:22 +0000] "GET /bags/ HTTP/1.1" 404 536 "-" "Consul Health Check" "-"

It looks like the /bags/ endpoint is returning a 404. Can you confirm whether /bags/ is supposed to be a valid URL?

Yes the endpoint returns 200:

The significant difference in behavior between “with health check” and “without health check” is why I am trying to learn more about the details of when and how the health checks are run. Maybe they are run before volumes are properly mounted? :man_shrugging:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.