Traffic (still) routed to instances not passing health check

EDIT: minimal reproduction repository and logs, further asserting this is an issue with the proxy itself: Traffic (still) routed to instances not passing health check - #7 by sheerlox

Related post from 9 months ago: Traffic routed to instances not passing health check.

During startup, my Elixir API loads and processes a significant chunk of data, which takes a little more than 2 seconds on 4x shared CPU machines.

The health check function validates this process completed, which is supposed to prevent the 5 errors at the beginning of the logs below.

I’ve added a log when the health check passes in my backend to ensure the issue isn’t on my end (which can be seen on the last line).

[d8dd963b929298] Running UnillWeb.Endpoint with Bandit 1.6.7 at :::8080 (http)
[d8dd963b929298] ** (ArgumentError) errors were found at the given arguments: * 1st argument: the table identifier does not refer to an existing ETS table (stdlib 6.2) :ets.lookup(nil, " a")
[d8dd963b929298] ** (ArgumentError) errors were found at the given arguments: * 1st argument: the table identifier does not refer to an existing ETS table (stdlib 6.2) :ets.lookup(nil, " n")
[d8dd963b929298] ** (ArgumentError) errors were found at the given arguments: * 1st argument: the table identifier does not refer to an existing ETS table (stdlib 6.2) :ets.lookup(nil, " x")
[d8dd963b929298] ** (ArgumentError) errors were found at the given arguments: * 1st argument: the table identifier does not refer to an existing ETS table (stdlib 6.2) :ets.lookup(nil, " o")
[d8dd963b929298] ** (ArgumentError) errors were found at the given arguments: * 1st argument: the table identifier does not refer to an existing ETS table (stdlib 6.2) :ets.lookup(nil, " p")
[d8dd963b929298] Loaded 17769 entities from database: - 6730 skills - 6730 translations - 4309 variations
[d8dd963b929298] Built and persisted entity map for 17769 entities.
[d8dd963b929298] Healthcheck failed: {:ets_table_not_found, :inverted_index}
[d8dd963b929298] Healthcheck failed: {:ets_table_not_found, :inverted_index}
[d8dd963b929298] Built and persisted inverted index for 5611 ngrams.
[d8dd963b929298] Healthcheck passed

I would expect the first requests to be health checks, and no request to be forwarded to the instance while they aren’t passing, as mentioned in Healthcheck-based Routing:

  • the instance is still busy starting up and is not ready to serve requests

Here’s the relevant part of my config:

[http_service]
  [[http_service.checks]]
    interval = '5s'
    timeout = '1s'
    grace_period = '10s'
    method = 'GET'
    path = '/api/health_check'

Have there been any advances on this issue since then? Or maybe something I’ve missed in the configuration?

Thanks in advance!

Some more context

This is happening during an autoscaling operation while load testing my deployment. I only use a soft_limit and no hard_limit:

  [http_service.concurrency]
    type = 'requests'
    soft_limit = 1

[[vm]]
  memory = '1024mb'
  cpu_kind = 'shared'
  cpus = 4

The soft_limit is so low because the API calls are quite CPU-intensive and I’m trying to stay within the baseline even under load, keeping the CPU balance to handle requests while the new machines start. This makes the autoscaling kick in at about 8-10 req/sec.

Would it be possible for your healthcheck to succeed even though it still has a few seconds of processing to do? Your API could return a 503 (Service Unavailable) during that time, even though the healthcheck is good.

In general healthcheck fails are for unhealthy systems, and they indicate to the orchestrator that a container needs to be bounced/replaced etc. That does not seem to apply here.

It doesn’t seem to be an issue with the health check. It is independent of the startup process itself and always returns either a 200 or a 503 within milliseconds. The screenshot below shows it works as expected:

If the health check isn’t passing on a machine, any request to actual API routes will raise an error and hence return a 500. Only the health check can send back a 503

I’m not sure about your last statement, however. it is indeed the case that the health check might start failing again after some time (e.g. the machine state gets corrupted or the database can’t be reached), but I don’t see why it shouldn’t be used to indicate a system isn’t healthy yet after the machine starts.

I’m not an expert here in terms of how Fly works. But I wonder if a container scheduler could feasibly be set to restart a container that is found to be unhealthy, since it won’t know that the container can make itself healthy on its own (in this case).

One thing you could do, if you’re willing to do some of your own orchestration, is to create a machine and not start it, cordon it, and then use a custom health-check to let your orchestrator know when the container can be uncordoned.

Thanks for trying to help @halfer. For now, the best solution is to make starting the endpoint dependent on the startup process, which wouldn’t be too hard with Elixir but still require a bit of refactoring.

I’m simply trying to point out to the Fly team that the documented (and logical) behavior of not forwarding requests to machines until their health checks are passing is not working as intended.

I have setup an all-in-one minimal reproduction repository to demonstrate the issue: GitHub - sheerlox-repros/fly_health_check

Here’s the resulting logs from running the instructions in the README:

2025-02-22T20:15:06Z proxy[48ed67dc599668] cdg [info]Starting machine
2025-02-22T20:15:06Z app[48ed67dc599668] cdg [info]2025-02-22T20:15:06.547959464 [01JMQMPCFX63KNFJYGREQX0WPR:main] Running Firecracker v1.7.0
2025-02-22T20:15:07Z app[48ed67dc599668] cdg [info] INFO Starting init (commit: 67f51b8b)...
2025-02-22T20:15:07Z app[48ed67dc599668] cdg [info] INFO Preparing to run: `/app/bin/server` as nobody
2025-02-22T20:15:07Z app[48ed67dc599668] cdg [info] INFO [fly api proxy] listening at /.fly/api
2025-02-22T20:15:07Z runner[48ed67dc599668] cdg [info]Machine started in 1.304s
2025-02-22T20:15:07Z proxy[48ed67dc599668] cdg [info]machine started in 1.310374664s
2025-02-22T20:15:08Z app[48ed67dc599668] cdg [info]2025/02/22 20:15:08 INFO SSH listening listen_address=[fdaa:3:d216:a7b:16a:f8cb:a07a:2]:22
2025-02-22T20:15:10Z app[48ed67dc599668] cdg [info]20:15:10.566 [info] Startup process started
2025-02-22T20:15:10Z app[48ed67dc599668] cdg [info]20:15:10.583 [info] Running FlyHealthCheckWeb.Endpoint with Bandit 1.6.7 at :::8080 (http)
2025-02-22T20:15:10Z app[48ed67dc599668] cdg [info]20:15:10.596 [info] Access FlyHealthCheckWeb.Endpoint at https://fly-health-check.fly.dev
2025-02-22T20:15:10Z app[48ed67dc599668] cdg [info] WARN Reaped child process with pid: 690 and signal: SIGUSR1, core dumped? false
2025-02-22T20:15:10Z app[48ed67dc599668] cdg [info]20:15:10.788 [error] Healthcheck failed: :not_ready
2025-02-22T20:15:10Z proxy[48ed67dc599668] cdg [info]machine became reachable in 3.224864138s
2025-02-22T20:15:10Z proxy[48ed67dc599668] cdg [info]machine became reachable in 3.235562454s
2025-02-22T20:15:11Z proxy[48ed67dc599668] cdg [info]machine became reachable in 3.269394045s
2025-02-22T20:15:11Z proxy[48ed67dc599668] cdg [info]machine became reachable in 3.297547363s
2025-02-22T20:15:11Z proxy[48ed67dc599668] cdg [info]machine became reachable in 3.342454458s
2025-02-22T20:15:11Z proxy[48ed67dc599668] cdg [info]machine became reachable in 3.358317458s
2025-02-22T20:15:11Z app[48ed67dc599668] cdg [info]20:15:11.009 [error] Request before health checks are passing
2025-02-22T20:15:11Z app[48ed67dc599668] cdg [info]20:15:11.020 [error] Request before health checks are passing
2025-02-22T20:15:11Z app[48ed67dc599668] cdg [info]20:15:11.053 [error] Request before health checks are passing
2025-02-22T20:15:11Z app[48ed67dc599668] cdg [info]20:15:11.060 [error] Request before health checks are passing
2025-02-22T20:15:11Z app[48ed67dc599668] cdg [info]20:15:11.072 [error] Request before health checks are passing
2025-02-22T20:15:11Z app[48ed67dc599668] cdg [info]20:15:11.082 [error] Request before health checks are passing
2025-02-22T20:15:11Z app[48ed67dc599668] cdg [info]20:15:11.105 [error] Request before health checks are passing
2025-02-22T20:15:11Z app[48ed67dc599668] cdg [info]20:15:11.112 [error] Request before health checks are passing
2025-02-22T20:15:11Z app[48ed67dc599668] cdg [info]20:15:11.124 [error] Request before health checks are passing
2025-02-22T20:15:11Z app[48ed67dc599668] cdg [info]20:15:11.126 [error] Request before health checks are passing
2025-02-22T20:15:11Z proxy[48ed67dc599668] cdg [info]machine became reachable in 3.451439049s
2025-02-22T20:15:11Z proxy[48ed67dc599668] cdg [info]machine became reachable in 3.481055955s
2025-02-22T20:15:11Z proxy[48ed67dc599668] cdg [info]machine became reachable in 3.523875611s
2025-02-22T20:15:11Z proxy[48ed67dc599668] cdg [info]machine became reachable in 3.535894472s
2025-02-22T20:15:15Z app[48ed67dc599668] cdg [info]20:15:15.566 [info] Startup process completed
2025-02-22T20:15:15Z app[48ed67dc599668] cdg [info]20:15:15.794 [info] Healthcheck passed
2025-02-22T20:15:16Z health[48ed67dc599668] cdg [info]Health check on port 8080 is now passing.
1 Like

@pavel maybe you might weigh in on that?

Hey @sheerlox

Yeah. Like I described in that post, the proxy currently doesn’t check health check status if it’s starting a machine. It forwards the request to the machine as soon as it’s able to connect to it. One of the reasons for this is to make sure we don’t wait for the whole health check interval, as machine is started in context of a request and we don’t want to delay the request for such a long time. Machine may be ready to accept a request at the beginning of the interval, but the proxy will have to wait for a long time (10 seconds in your case) to forward the request.

This only happens when the proxy is starting a machine (autostart = true). If you are starting a machine manually the proxy won’t forward any request to it until it sees it in a started state with passing health checks.
And if the health check starts failing while machine is running the edge proxy will respect it and will stop forwarding requests to it.

One way to fix this in your case is to start listening for requests after the data is already loaded.

You can also try suspending the machines instead of stopping them, so that the data is already available when machine is resumed by the proxy.

Thanks for the clarification.

I don’t want to mince words here, so let me be as kind and clear as possible:

You need to write this down for us (your customers). This behavior is not obvious and it is not clear.

Most importantly: it is not documented.

This (very helpful) post you’ve shared in the community forums is the only written documentation I can find on the internet that confirms, clarifies, and explains the context for this behavior.

There are at least 7 places this could be written down, and mentioning it in even 1 of them would help countless users:

  1. Fly Proxy autostop/autostart · Fly Docs
  2. How Fly Proxy routes incoming public requests
  3. How Fly Proxy routes requests over your private network (6PN)
  4. Autoscaling · Fly Docs
  5. App Config Reference (fly.toml) > http_service.checks
  6. App Config Reference (fly.toml) > services.http_checks
  7. App Config Reference (fly.toml) > The checks section

Speaking only for myself now: the pervasive, ambiguous and overloaded use of “healthy” throughout Fly Docs for routing and Fly Proxy is setting your customers up to fail.

Here are 2 unedited quotes from the Fly Proxy docs (links 2 + 3 above), emphasis mine:

When one of your app’s users makes a request to your app on Fly.io, the request lands on our closest edge server. Fly Proxy (on the edge server) checks out the details of the request, adds the appropriate headers, redirects HTTP traffic to HTTPS, and routes the request through the WireGuard tunnel backhaul to the nearest healthy worker server that hosts your app’s Machines.

When you assign a Flycast address to your app, the traffic gets routed through Fly Proxy while remaining entirely in your private network. When App 1 makes a request to App 2 in the same private network, Fly Proxy on the worker hosting App 1’s Machines checks out the details of the request, adds the appropriate headers, and routes the request through the same WireGuard tunnel backhaul to the nearest healthy worker that hosts Machines for App 2. Fly Proxy on that server sends the request to a Machine running App 2.

These docs are misleading. In some common cases they’re wrong (from above):

the proxy currently doesn’t check health check status if it’s starting a machine. It forwards the request to the machine as soon as it’s able to connect to it.

Please get this behavior documented for your customers.


:arrow_down: :arrow_down: :arrow_down: :arrow_down: :arrow_down: :arrow_down: :arrow_down: :arrow_down: :arrow_down: :arrow_down: :arrow_down:

For posterity, and anyone who finds this thread, if Fly Proxy autostarts a machine to service a request:

  1. It ignores all health checks for that request
  2. It will send that request to your machine as soon as the connection is accepted
  3. It will keep sending subsequent requests to that machine until it records a failed health check
3 Likes

I want to add that I’ve been troubleshooting a similar issue with Fly support via email and also came across this very strange & unintuitive behavior.

I was under the impression that autoscale starts a machine, waits until health checks pass, then finally routes traffic to it. In contrast, this is what Fly told me:

The proxy doesn’t pre-emptively start a machine then route to it – it’s the action of a request being routed to a stopped machine that causes it to start.

Since the machine is already in a stopped state, it won’t have any checks registered. The proxy assumes that since it’s being started, it’ll be ready eventually. Once the start up process begins, the proxy won’t route any more requests to the machine until the checks are passing. But the requests that started the machine will wait until either the machine is healthy or the request fails (the PM05 errors in your logs). If the request fails, we’ll automatically retry the request to another machine, but the requests that actually started the machine will see some additional delay.

In your specific case, it looks like it was a combination of high traffic and the long start up time that caused those connections to fail. Roughly this is what would have happened:

  1. All machines are at or over their soft limit, or at their hard limits.
  2. A burst of requests come in within a short period
  3. The proxy opts to start a new machine to handle them
  4. All those requests are routed to the e784014b237458 machine, which begins it’s boot process
  5. The e784014b237458 machine is marked unhealthy. No further requests are sent to it, but the ones that led to the start continue to wait for it to start up
  6. The requests fail and are re-routed to healthy instances
  7. e784014b237458 finishes booting, passes healthchecks, and accepts new connections
1 Like