Traffic routed to instances not passing health check

Can’t seem to get Healthcheck-based Routing to work.

I have an app with nginx and php-fpm. As php-fpm takes sometime to start I’ve setup php-fpm to respond to ping requests with ping.path = /_fpm/ping and proxy the requests via nginx with:

location = /_fpm/ping {
  access_log off;
  include fastcgi.conf;
  fastcgi_pass 127.0.0.1:9000;
}

I’ve setup a health check in fly.toml like this:

[[http_service.checks]]
  method = "GET"
  path = "/_fpm/ping"
  grace_period = "3s"
  interval = "5s"
  timeout = "2s"

Health check is working correctly:

  NAME                    | STATUS  | MACHINE        | LAST UPDATED | OUTPUT
--------------------------*---------*----------------*--------------*---------
  servicecheck-00-http-80 | passing | 6e8254d5c09d87 | 4m12s ago    | pong
--------------------------*---------*----------------*--------------*---------

I have scale to 0 enabled for this app. I expect that when there’s a new request, the machine will start and wait for the health check to pass before the request is sent to the machine. However it seems like the request is sent to the machine right after it’s started and before the health check is passed resulting in 502. This is visible from the logs:

2024-05-28T05:49:34.305 proxy[6e8254d5c09d87] nrt [info] Starting machine
...
2024-05-28T05:49:34.741 runner[6e8254d5c09d87] nrt [info] Machine started in 433ms
2024-05-28T05:49:34.741 proxy[6e8254d5c09d87] nrt [info] machine started in 436.819157ms
2024-05-28T05:49:34.741 app[6e8254d5c09d87] nrt [info] 2024/05/28 05:49:34 INFO SSH listening listen_address=[fdaa:3:1193:a7b:2df:8580:82d:2]:22 dns_server=[fdaa::3]:53
2024-05-28T05:49:34.776 proxy[6e8254d5c09d87] nrt [info] machine became reachable in 34.410035ms
2024-05-28T05:49:34.778 app[6e8254d5c09d87] nrt [info] 2024/05/28 05:49:34 [error] 329#329: *1 connect() failed (111: Connection refused) while connecting to upstream, client: 172.16.1.154, server: _, request: "GET / HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "***.fly.dev"
2024-05-28T05:49:34.778 app[6e8254d5c09d87] nrt [info] 172.16.1.154 - - [28/May/2024:05:49:34 +0000] "GET / HTTP/1.1" 502 157 "-" "curl/8.5.0"
2024-05-28T05:49:34.779 app[6e8254d5c09d87] nrt [info] 2024/05/28 05:49:34 [info] 329#329: *1 client 172.16.1.154 closed keepalive connection
2024-05-28T05:49:34.894 app[6e8254d5c09d87] nrt [info] [28-May-2024 05:49:34] NOTICE: fpm is running, pid 321
2024-05-28T05:49:34.894 app[6e8254d5c09d87] nrt [info] [28-May-2024 05:49:34] NOTICE: ready to handle connections
2024-05-28T05:49:39.180 app[6e8254d5c09d87] nrt [info] 127.0.0.1 - 28/May/2024:05:49:39 +0000 "GET /_fpm/ping" 200
2024-05-28T05:49:39.590 health[6e8254d5c09d87] nrt [info] Health check on port 80 is now passing.
...
2024-05-28T05:49:45.333 app[6e8254d5c09d87] nrt [info] 127.0.0.1 - 28/May/2024:05:49:45 +0000 "GET /index.php" 302
2024-05-28T05:49:45.333 app[6e8254d5c09d87] nrt [info] 172.16.1.154 - - [28/May/2024:05:49:45 +0000] "GET / HTTP/1.1" 302 5 "-" "curl/8.5.0"

As shown from the logs, the first request was sent to the machine right after machine became reachable and before Health check on port 80 is now passing thus resulting immediate 502. I’ve included a normal request at the end after the health check has passed as a reference.

Is this the expected behavior? Did I misunderstand how healthcheck-based routing should work?

1 Like

Hey @happydenn

Yeah, this looks like a problem in Fly proxy’s logic. The proxy checks health status on edge (to decide where to route a request) and on the worker node (to decide whether to accept the request or not). But on the worker node the health is checked before the machine is started, so once the machine is started and the proxy is able to connect to it, it forwards the request to the machine.

I’ll see if we can improve this to respect health status for such requests.

1 Like

Cool. It would be great if the request is only sent to the machine after the machine is started and passing health checks.

FYI the current behavior is observed on both public requests and private requests thru flycast hostname.

Meanwhile do you have any suggestions to route based on successful health checks? Or should I add such checks from the application side (eg. starts nginx after php-fpm is ready)?

Yeah, I think it will be the best solution for now to unblock you.

Cool. I’ve added a shell script using fcgi-probe to make sure php-fpm is accepting requests before starting nginx. The script is pasted below just in case if anyone need it for the time being:

#!/bin/sh

# adjust the path to match your app
until fcgi-probe -s tcp://127.0.0.1:9000/_fpm/ping
do
    echo "Waiting for php-fpm..."
    sleep 1
done

echo "php-fpm started, starting nginx..."
exec nginx

Please do consider fixing the proxy’s intended behavior for health checks so the script won’t be needed in the future though. Thanks!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.