I’ve noticed some unexpected behaviour concerning health checks and proxy routing when resuming from a suspended state. Here’s what seems to happen:
-
An initial HTTP request hits the proxy, causing the machine in a suspended state to resume.
-
The machine becomes reachable.
-
A health check is reported as failing, but I don’t believe it’s actually running, as there are no server access logs indicating this (details provided below).
-
Despite the failing health check, the initial request is routed to the web server and is successfully handled.
-
Subsequent requests, however, are not routed to the machine. These requests (e.g., for assets like stylesheets) stall until the next health check (up to 15 seconds later) passes.
The outcome is a web app which takes about ten seconds to load a visible web page from a suspended state. This delay occurs even though the initial HTTP request, which triggers the machine’s resume, is processed swiftly.
Observations
-
Step 3: It seems there’s a bug here. If the health check were running as expected, it should have shown up in my logs.
-
Step 4 and 5: This behaviour raises questions about the expected behaviour when a health check fails immediately after a machine resumes. Afterall, the purpose of resuming from a suspend (rather than stopped) state is to be able to serve requests quickly, and if we have to wait for a full health check cycle, it somewhat defeats the purpose of having a fast resume mechanism.
Questions
-
Why does the health check indicate that it has executed (step 3), but there is no evidence of it actually running?
-
Is it expected behaviour for the initial request (step 4) to be routed successfully even if the health check fails?
-
If the initial request routing in step 4 is intended, why are subsequent requests not routed? If it is not intended, should the health check mechanism be adjusted to allow for much faster retries after resuming from a suspended state? Waiting up to 15 seconds for the next health check interval significantly undermines the usefulness of the suspend/resume feature.
Supporting Logs
Below are the logs demonstrating the issue.
Fly.io logs
2025-01-02T22:47:51.093 app[5683049c50ee38] lhr [info] 172.19.6.217 - - [02/Jan/2025:22:47:51 +0000] "GET /gallery HTTP/1.1" 200 22002 "-" "Consul Health Check" "-"
2025-01-02T22:47:57.860 app[5683049c50ee38] lhr [info] Virtual machine has been suspended
2025-01-02T22:48:33.454 proxy[5683049c50ee38] lhr [info] Starting machine
2025-01-02T22:48:33.575 app[5683049c50ee38] lhr [info] 2025-01-02T22:48:33.575340181 [01JGMFZE6EYMHV9AWW408ARA8D:main] Running Firecracker v1.7.0
2025-01-02T22:48:33.665 runner[5683049c50ee38] lhr [info] Machine started in 203ms
2025-01-02T22:48:33.674 proxy[5683049c50ee38] lhr [info] machine became reachable in 7.308649ms
2025-01-02T22:48:33.826 health[5683049c50ee38] lhr [error] Health check on port 80 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
2025-01-02T22:48:39.571 health[5683049c50ee38] lhr [info] Health check on port 80 is now passing.
Nginx access logs
172.19.6.217 - - [02/Jan/2025:22:47:51 +0000] "GET /gallery HTTP/1.1" 200 22002 "-" "Consul Health Check" "-"
172.16.6.218 - - [02/Jan/2025:22:47:56 +0000] "GET /gallery HTTP/1.1" 200 71764 "-" "curl/7.88.1" "2a00:23c7:a1b9:7201:4a15:349d:6f3c:9adc, 2a09:8280:1::3b:314d:0"
172.19.6.217 - - [02/Jan/2025:22:48:38 +0000] "GET /gallery HTTP/1.1" 200 22002 "-" "Consul Health Check" "-"
We observe that the final health check occurs at 22:47:51
before the machine is suspended. According to the Fly.io logs, after the machine resumes at 22:48:33
, a health check is marked as failing. However, this failing health check does not appear in the Nginx access logs as the next logged request. Instead, the next request is from a curl
command, which is the HTTP request that triggered the machine’s resume (the timestamp discrepancy for the curl request is due to the machine’s clock taking a few seconds to update after resuming, so it can be ignored).
Following this, we see a passing health check recorded at 22:48:38
, which aligns with the Fly.io logs. This suggests that while the logs record the failing health check at 22:48:33
, there is no actual evidence of this request being made to the server. This discrepancy leads me to believe that the failing health check may not have been executed properly or was skipped entirely.