Issue with Proxy Routing and Health Checks After Machine Resume

kiancross · January 2, 2025, 11:40pm

I’ve noticed some unexpected behaviour concerning health checks and proxy routing when resuming from a suspended state. Here’s what seems to happen:

An initial HTTP request hits the proxy, causing the machine in a suspended state to resume.
The machine becomes reachable.
A health check is reported as failing, but I don’t believe it’s actually running, as there are no server access logs indicating this (details provided below).
Despite the failing health check, the initial request is routed to the web server and is successfully handled.
Subsequent requests, however, are not routed to the machine. These requests (e.g., for assets like stylesheets) stall until the next health check (up to 15 seconds later) passes.

The outcome is a web app which takes about ten seconds to load a visible web page from a suspended state. This delay occurs even though the initial HTTP request, which triggers the machine’s resume, is processed swiftly.

Observations

Step 3: It seems there’s a bug here. If the health check were running as expected, it should have shown up in my logs.
Step 4 and 5: This behaviour raises questions about the expected behaviour when a health check fails immediately after a machine resumes. Afterall, the purpose of resuming from a suspend (rather than stopped) state is to be able to serve requests quickly, and if we have to wait for a full health check cycle, it somewhat defeats the purpose of having a fast resume mechanism.

Questions

Why does the health check indicate that it has executed (step 3), but there is no evidence of it actually running?
Is it expected behaviour for the initial request (step 4) to be routed successfully even if the health check fails?
If the initial request routing in step 4 is intended, why are subsequent requests not routed? If it is not intended, should the health check mechanism be adjusted to allow for much faster retries after resuming from a suspended state? Waiting up to 15 seconds for the next health check interval significantly undermines the usefulness of the suspend/resume feature.

Supporting Logs

Below are the logs demonstrating the issue.

Fly.io logs

2025-01-02T22:47:51.093 app[5683049c50ee38] lhr [info] 172.19.6.217 - - [02/Jan/2025:22:47:51 +0000] "GET /gallery HTTP/1.1" 200 22002 "-" "Consul Health Check" "-"

2025-01-02T22:47:57.860 app[5683049c50ee38] lhr [info] Virtual machine has been suspended

2025-01-02T22:48:33.454 proxy[5683049c50ee38] lhr [info] Starting machine

2025-01-02T22:48:33.575 app[5683049c50ee38] lhr [info] 2025-01-02T22:48:33.575340181 [01JGMFZE6EYMHV9AWW408ARA8D:main] Running Firecracker v1.7.0

2025-01-02T22:48:33.665 runner[5683049c50ee38] lhr [info] Machine started in 203ms

2025-01-02T22:48:33.674 proxy[5683049c50ee38] lhr [info] machine became reachable in 7.308649ms

2025-01-02T22:48:33.826 health[5683049c50ee38] lhr [error] Health check on port 80 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.

2025-01-02T22:48:39.571 health[5683049c50ee38] lhr [info] Health check on port 80 is now passing.

Nginx access logs

172.19.6.217 - - [02/Jan/2025:22:47:51 +0000] "GET /gallery HTTP/1.1" 200 22002 "-" "Consul Health Check" "-"
172.16.6.218 - - [02/Jan/2025:22:47:56 +0000] "GET /gallery HTTP/1.1" 200 71764 "-" "curl/7.88.1" "2a00:23c7:a1b9:7201:4a15:349d:6f3c:9adc, 2a09:8280:1::3b:314d:0"
172.19.6.217 - - [02/Jan/2025:22:48:38 +0000] "GET /gallery HTTP/1.1" 200 22002 "-" "Consul Health Check" "-"

We observe that the final health check occurs at 22:47:51 before the machine is suspended. According to the Fly.io logs, after the machine resumes at 22:48:33, a health check is marked as failing. However, this failing health check does not appear in the Nginx access logs as the next logged request. Instead, the next request is from a curl command, which is the HTTP request that triggered the machine’s resume (the timestamp discrepancy for the curl request is due to the machine’s clock taking a few seconds to update after resuming, so it can be ignored).

Following this, we see a passing health check recorded at 22:48:38, which aligns with the Fly.io logs. This suggests that while the logs record the failing health check at 22:48:33, there is no actual evidence of this request being made to the server. This discrepancy leads me to believe that the failing health check may not have been executed properly or was skipped entirely.

khuezy · January 3, 2025, 1:04am

I’m curious why you’re hitting a user facing route for your health check. Usually you hit a /api/health that returns 204 - No Content.

Hypermind · January 3, 2025, 1:55am

I believe that I hit the same issue. In my case, the app has two machines in different regions, reg1 and reg2 for brevity. What I observe is that the first request resumes reg1 machine as expected, but the second request goes to reg2 machine and resumes it too. I believe that this may be happening due to failing health checks for reg1 machine.

Yes, this causes delays for my app too. Thankfully, the app provides only a programmatic API, so it is not so noticeable as it would be in case of interactive UI.

kiancross · January 3, 2025, 3:43am

Simply put, if that route returns 200, then my app is healthy (the web server is up, has a connection to the database, etc.). I’m hosting an open-source application that doesn’t have a /healthcheck route, and I didn’t want to write my own. Is there a downside to hitting a user-facing route?

khuezy · January 3, 2025, 4:01am

Yes, plenty, eg a big waste of outbound data over the network if you have to send back HTML/JS/CSS. Health check should be OK no data ideally.

kiancross · January 3, 2025, 4:37am

I can also reproduce this. I set auto_start_machines to true, and stop two machines. I then send two HTTP requests, one after another. Both the machines start. This seems unexpected.

khuezy · January 3, 2025, 4:48am

What’s unexpected about that behavior? Sounds correct to me. It’s exasperated by the fact that your health check is not a simple API, it’s rendering the gallery - what ever that is. The proxy most likely would route the second request to the second machine in this case.

kiancross · January 3, 2025, 6:32am

It’s unexpected because it makes the feature entirely useless. So much so that I assume I must be doing something wrong (although it doesn’t appear as though you have any idea what that is).

Yes, plenty, eg a big waste of outbound data over the network if you have to send back HTML/JS/CSS.

Health checks are run from an internal machine, so ‘big waste of outbound data’ isn’t exactly accurate, is it?

It’s exasperated by the fact that your health check is not a simple API, it’s rendering the gallery - what ever that is.

No ‘rendering’ is happening: a simple HTTP request is being made to the /gallery endpoint.

kiancross · January 3, 2025, 6:34am

I’ve done some more testing and created a minimal example to reproduce the behaviour. The example is available here. To test it:

Deploy the app and start two Fly machines.
Stop both machines.
Send two HTTP requests in very quick succession (e.g., using curl) to /.

Findings

When the requests are sent in quick succession, only one machine starts as expected. Here’s the relevant log snippet (full version available here):

2025-01-03T06:02:01.635 proxy[185e591b277468] lhr [info] Starting machine

2025-01-03T06:02:02.440 app[185e591b277468] lhr [info] INFO Preparing to run: `/docker-entrypoint.sh /delay-start.sh` as root

2025-01-03T06:02:02.460 app[185e591b277468] lhr [info] Sleeping to delay

2025-01-03T06:02:02.502 runner[185e591b277468] lhr [info] Machine started in 862ms

2025-01-03T06:02:02.503 proxy[185e591b277468] lhr [info] machine started in 868.304049ms

2025-01-03T06:02:03.229 health[185e591b277468] lhr [error] Health check on port 80 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.

2025-01-03T06:02:07.498 app[185e591b277468] lhr [info] 2025/01/03 06:02:07 [notice] 329#329: nginx/1.27.3

2025-01-03T06:02:07.670 proxy[185e591b277468] lhr [info] waiting for machine to be reachable on 0.0.0.0:80 (waited 5.166385667s so far)

2025-01-03T06:02:07.670 proxy[185e591b277468] lhr [info] machine became reachable in 5.16666028s

2025-01-03T06:02:07.671 app[185e591b277468] lhr [info] 172.16.12.194 - - [03/Jan/2025:06:02:07 +0000] "GET / HTTP/1.1" 200 12 "-" "curl/7.88.1" "2a00:23c7:a1b9:7201:4a15:349d:6f3c:9adc, 2a09:8280:1::5b:5138:0"

2025-01-03T06:02:08.011 proxy[185e591b277468] lhr [info] waiting for machine to be reachable on 0.0.0.0:80 (waited 5.508039539s so far)

2025-01-03T06:02:08.012 proxy[185e591b277468] lhr [info] machine became reachable in 5.508416117s

2025-01-03T06:02:08.012 app[185e591b277468] lhr [info] 172.16.12.194 - - [03/Jan/2025:06:02:08 +0000] "GET / HTTP/1.1" 200 12 "-" "curl/7.88.1" "2a00:23c7:a1b9:7201:4a15:349d:6f3c:9adc, 2a09:8280:1::5b:5138:0"

2025-01-03T06:02:21.524 app[185e591b277468] lhr [info] 172.19.12.193 - - [03/Jan/2025:06:02:21 +0000] "GET / HTTP/1.1" 200 12 "-" "Consul Health Check" "-"

2025-01-03T06:02:22.157 health[185e591b277468] lhr [info] Health check on port 80 is now passing.

The first machine starts successfully, and both requests are handled by it. However, if you stop the started machine and repeat the test with a slight delay (e.g., 1–2 seconds) between the two curl requests, both machines start. Here’s the relevant log snippet (full version here):

2025-01-03T06:03:58.419 proxy[185e591b277468] lhr [info] Starting machine

2025-01-03T06:03:59.233 app[185e591b277468] lhr [info] INFO Preparing to run: `/docker-entrypoint.sh /delay-start.sh` as root

2025-01-03T06:03:59.256 app[185e591b277468] lhr [info] Sleeping to delay

2025-01-03T06:03:59.263 runner[185e591b277468] lhr [info] Machine started in 839ms

2025-01-03T06:03:59.264 proxy[185e591b277468] lhr [info] machine started in 844.652889ms

2025-01-03T06:03:59.615 proxy[185e591b277468] lhr [error] [PC01] instance refused connection. is your app listening on 0.0.0.0:80? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)

2025-01-03T06:03:59.642 proxy[6e82539c645318] lhr [info] Starting machine

2025-01-03T06:03:59.755 health[185e591b277468] lhr [error] Health check on port 80 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.

2025-01-03T06:04:00.512 app[6e82539c645318] lhr [info] INFO Preparing to run: `/docker-entrypoint.sh /delay-start.sh` as root

2025-01-03T06:04:00.549 app[6e82539c645318] lhr [info] Sleeping to delay

2025-01-03T06:04:00.551 runner[6e82539c645318] lhr [info] Machine started in 904ms

2025-01-03T06:04:00.553 proxy[6e82539c645318] lhr [info] machine started in 911.328001ms

2025-01-03T06:04:00.878 health[6e82539c645318] lhr [error] Health check on port 80 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.

2025-01-03T06:04:04.291 app[185e591b277468] lhr [info] 2025/01/03 06:04:04 [notice] 329#329: nginx/1.27.3

2025-01-03T06:04:04.403 proxy[185e591b277468] lhr [info] waiting for machine to be reachable on 0.0.0.0:80 (waited 5.139148242s so far)

2025-01-03T06:04:04.404 proxy[185e591b277468] lhr [info] machine became reachable in 5.139535249s

2025-01-03T06:04:04.405 app[185e591b277468] lhr [info] 172.16.12.194 - - [03/Jan/2025:06:04:04 +0000] "GET / HTTP/1.1" 200 12 "-" "curl/7.88.1" "2a00:23c7:a1b9:7201:4a15:349d:6f3c:9adc, 2a09:8280:1::5b:5138:0"

2025-01-03T06:04:05.602 app[6e82539c645318] lhr [info] 2025/01/03 06:04:05 [notice] 330#330: nginx/1.27.3

2025-01-03T06:04:05.785 proxy[6e82539c645318] lhr [info] waiting for machine to be reachable on 0.0.0.0:80 (waited 5.231277932s so far)

2025-01-03T06:04:05.785 proxy[6e82539c645318] lhr [info] machine became reachable in 5.231757335s

2025-01-03T06:04:05.787 app[6e82539c645318] lhr [info] 172.16.14.154 - - [03/Jan/2025:06:04:05 +0000] "GET / HTTP/1.1" 200 12 "-" "curl/7.88.1" "2a00:23c7:a1b9:7201:4a15:349d:6f3c:9adc, 2a09:8280:1::5b:5138:0"

It looks like the killer line is:

2025-01-03T06:03:59.615 proxy[185e591b277468] lhr [error] [PC01] instance refused connection. is your app listening on 0.0.0.0:80? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)

The issue occurs regardless of whether I have any health checks.

Problem

The problematic line appears to be this:

2025-01-03T06:03:59.615 proxy[185e591b277468] lhr [error] [PC01] instance refused connection. is your app listening on 0.0.0.0:80? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)

It seems that if the second HTTP request arrives before the first machine is ready (but after it has started), the proxy tries to connect to the exposed port. If the connection fails, the proxy assumes the machine is broken and starts the second machine.

Is this expected behaviour?

kiancross · January 3, 2025, 7:12am

And more so, if you send a third HTTP request a couple of seconds after the second, it fails entirely, with the log showing (see full log here):

2025-01-03T07:05:30.692 health[6e82539c645318] lhr [info] Health check on port 80 is now passing.
2025-01-03T07:05:34.988 proxy[6e82539c645318] lhr [error] [PR04] could not find a good candidate within 21 attempts at load balancing

That doesn’t make any sense.

kiancross · January 3, 2025, 12:41pm

I’ve narrowed it down to the following browser-based example (which is essentially as @Hypermind described).

An HTML file which loads a few resources:

Hello World

<script src="script1.js"></script>
<script src="script2.js"></script>
<script src="script3.js"></script>
<script src="script4.js"></script>
<script src="script5.js"></script>

With health checks enabled. This causes the loading of the scriptX.js files to boot the second machine.

2025-01-03T12:36:27.365 app[6e82539c645318] lhr [info] 2025/01/03 12:36:27 [notice] 329#329: start worker process 330

2025-01-03T12:36:27.564 proxy[6e82539c645318] lhr [info] waiting for machine to be reachable on 0.0.0.0:80 (waited 5.247090865s so far)

2025-01-03T12:36:27.564 proxy[6e82539c645318] lhr [info] machine became reachable in 5.247497065s

2025-01-03T12:36:27.566 app[6e82539c645318] lhr [info] 172.16.14.154 - - [03/Jan/2025:12:36:27 +0000] "GET / HTTP/1.1" 200 188 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" "2a00:23c7:a1b9:7201:db38:f5ae:bc69:37f5, 2a09:8280:1::5b:5138:0"

2025-01-03T12:36:27.675 proxy[185e591b277468] lhr [info] Starting machine

2025-01-03T12:36:27.738 app[185e591b277468] lhr [info] 2025-01-03T12:36:27.738505504 [01JGP2JNSS00K6KY57WBS6YQPE:main] Running Firecracker v1.7.0

Full example here.

Weirdly, it doesn’t always happen. When I tested just loading one script tag, the issue didn’t seem to occur, which makes me feel like it is a bug.

khuezy · January 3, 2025, 1:09pm

Historically, myself and others have seen very large numbers (100s of GB) of outbound data even though our traffic was small. You can check your grafana metrics to see. To my knowledge, this was related to healthchecks, see a recent conversation on discord: Discord

I’m not aware of any follow up though.

This sounds like expected behavior of proxies for me. Are you looking for affinity/sticky sessions?

Why are you loading HTML in your healthchecks!?

machine became reachable in 5.139535249s

There’s certainly something wrong w/ your server, 5 seconds is too long.

I notice the delay is intentional? If so, then the proxy is doing its job handling requests to available machines.

kiancross · January 3, 2025, 1:39pm

Proxies don’t usually start machines, etc. When they do, the expected behaviour changes from what one might expect of a regular proxy.

Simply put, when that given HTML page loads in a browser, I don’t see why the proxy should start more than one machine, especially given that the machine has already demonstrated that it’s capable of serving index.html (which happens before the requests for scriptX.js are sent).

How is it relevant? It’s a small example that demonstrates the issue. Unless it’s the cause of the problem at hand (which it doesn’t seem to be), then derailing this thread by bringing it up repetitively isn’t helpful. If you believe it to be relevant to this specific issue then you haven’t explained why.

Why is it too long? In real life, servers take time to start up. In any case, I’ve dropped the time down to one second, and it’s exactly the same behaviour:

2025-01-03T13:33:28.804 proxy[185e591b277468] lhr [info] Starting machine

2025-01-03T13:33:29.599 app[185e591b277468] lhr [info] INFO Preparing to run: `/docker-entrypoint.sh /delay-start.sh` as root

2025-01-03T13:33:29.620 app[185e591b277468] lhr [info] Sleeping to delay

2025-01-03T13:33:29.632 runner[185e591b277468] lhr [info] Machine started in 824ms

2025-01-03T13:33:29.633 proxy[185e591b277468] lhr [info] machine started in 829.883129ms

2025-01-03T13:33:29.748 health[185e591b277468] lhr [error] Health check on port 80 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.

2025-01-03T13:33:30.675 app[185e591b277468] lhr [info] 2025/01/03 13:33:30 [notice] 329#329: nginx/1.27.3

2025-01-03T13:33:31.001 proxy[185e591b277468] lhr [info] machine became reachable in 1.367453527s

2025-01-03T13:33:31.003 app[185e591b277468] lhr [info] 172.16.12.194 - - [03/Jan/2025:13:33:31 +0000] "GET / HTTP/1.1" 200 188 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" "2a00:23ee:1198:bc12:47e4:daeb:dc63:18c3, 2a09:8280:1::5b:5138:0"

2025-01-03T13:33:31.168 proxy[6e82539c645318] lhr [info] Starting machine

2025-01-03T13:33:32.035 app[6e82539c645318] lhr [info] INFO Preparing to run: `/docker-entrypoint.sh /delay-start.sh` as root

2025-01-03T13:33:32.067 app[6e82539c645318] lhr [info] Sleeping to delay

2025-01-03T13:33:32.219 runner[6e82539c645318] lhr [info] Machine started in 886ms

2025-01-03T13:33:32.222 proxy[6e82539c645318] lhr [info] machine started in 1.053935259s

2025-01-03T13:33:33.134 app[6e82539c645318] lhr [info] 2025/01/03 13:33:33 [notice] 331#331: nginx/1.27.3

2025-01-03T13:33:33.406 proxy[6e82539c645318] lhr [info] machine became reachable in 1.184128646s

2025-01-03T13:33:33.408 app[6e82539c645318] lhr [info] 172.16.14.154 - - [03/Jan/2025:13:33:33 +0000] "GET /script5.js HTTP/1.1" 200 0 "https://t-wispy-shadow-4355.fly.dev/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" "2a00:23ee:1198:bc12:47e4:daeb:dc63:18c3, 2a09:8280:1::5b:5138:0"

2025-01-03T13:33:33.622 proxy[6e82539c645318] lhr [info] machine became reachable in 1.399956565s

2025-01-03T13:33:33.622 app[6e82539c645318] lhr [info] 172.16.14.154 - - [03/Jan/2025:13:33:33 +0000] "GET /script3.js HTTP/1.1" 200 0 "https://t-wispy-shadow-4355.fly.dev/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" "2a00:23ee:1198:bc12:47e4:daeb:dc63:18c3, 2a09:8280:1::5b:5138:0"

2025-01-03T13:33:33.627 proxy[6e82539c645318] lhr [info] machine became reachable in 1.40497394s

2025-01-03T13:33:33.627 app[6e82539c645318] lhr [info] 172.16.14.154 - - [03/Jan/2025:13:33:33 +0000] "GET /script1.js HTTP/1.1" 200 0 "https://t-wispy-shadow-4355.fly.dev/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" "2a00:23ee:1198:bc12:47e4:daeb:dc63:18c3, 2a09:8280:1::5b:5138:0"

2025-01-03T13:33:33.662 proxy[6e82539c645318] lhr [info] machine became reachable in 1.440310145s

2025-01-03T13:33:33.663 app[6e82539c645318] lhr [info] 172.16.14.154 - - [03/Jan/2025:13:33:33 +0000] "GET /script4.js HTTP/1.1" 200 0 "https://t-wispy-shadow-4355.fly.dev/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" "2a00:23ee:1198:bc12:47e4:daeb:dc63:18c3, 2a09:8280:1::5b:5138:0"

2025-01-03T13:33:33.675 proxy[6e82539c645318] lhr [info] machine became reachable in 1.452741931s

2025-01-03T13:33:33.675 app[6e82539c645318] lhr [info] 172.16.14.154 - - [03/Jan/2025:13:33:33 +0000] "GET /script2.js HTTP/1.1" 200 0 "https://t-wispy-shadow-4355.fly.dev/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" "2a00:23ee:1198:bc12:47e4:daeb:dc63:18c3, 2a09:8280:1::5b:5138:0"

2025-01-03T13:33:46.891 app[185e591b277468] lhr [info] 172.19.12.193 - - [03/Jan/2025:13:33:46 +0000] "GET / HTTP/1.1" 200 188 "-" "Consul Health Check" "-"

2025-01-03T13:33:48.047 health[185e591b277468] lhr [info] Health check on port 80 is now passing.

The machine it’s routing the traffic to isn’t ‘available’. It’s in a stopped state, and it would clearly be better to wait for the currently booting machine to finish booting. I can’t imagine that this wouldn’t be the intended behaviour, and I’m surprised that you believe what’s happening is normal or expected.

khuezy · January 3, 2025, 1:44pm

But how would a proxy know how long to wait? What if the machine1 is overloaded or stuck?

Edit: Maybe you need to adjust your health check grace period?

khuezy · January 3, 2025, 1:49pm

Usually healthchecks are simple and fast, semantics as /api/healthcheck. When you use /gallery, it’s unclear if that’s a user facing route or not. When you’re loading UI, the server response may be a magnitude longer than a simple 204 response (eg, react server components takes a long time) and the health check grace period assumes your server isn’t responding b/c of the timing… just throwing out ideas/examples.

kiancross · January 3, 2025, 1:58pm

The machine has already served the index.html page. The requests for scriptX.js come after index.html has already successfully been served.

The log about the failing health check is ignoring the grace period: it happens straight after the machine starts:

2025-01-03T13:33:29.633 proxy[185e591b277468] lhr [info] machine started in 829.883129ms

2025-01-03T13:33:29.748 health[185e591b277468] lhr [error] Health check on port 80 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.

khuezy · January 3, 2025, 2:05pm

I thought you said the server wasn’t ready yet. Is the index file from CDN or server origin.

Even still, if index.html has 2 resources it needs to load, each one may be routed to a different machine depending on load balancing. It sounds like you want sticky sessions.

kiancross · January 3, 2025, 6:06pm

I’m not sure we’re getting anywhere discussing this. It seems like you think the behaviour being exhibited is normal/expected, and I’m not sure I’m capable of explaining in different terms why I don’t think it is.

fly proxy isn’t just some dumb Nginx proxy sitting around round-robinning connections wherever it feels like. The entire point of me using fly is that it does the ‘right thing’, starting machines when there is heavy load, stopping machines when there is excess load, and also shutting down an app entirely when there is no load. For the latter to work, it needs to be able to start an app when there is load. And it’s somewhat iritating if, in doing so, every other machine is also started.

I don’t want ‘sticky sessions’ in the traditional sense. I want fly to use the machine it has already started whilst a single user is using my app, rather than starting every machine I have. How could you expect any other behaviour of such a feature?

khuezy · January 3, 2025, 6:48pm

There’s something weird going on in your app:

The outcome is a web app which takes about ten seconds to load a visible web page from a suspended state.

I have several apps each with multiple instances in multiple regions, I’ve never observed the behavior you noted.

This is the behavior I’m experiencing in my apps.

Anyways, I’d wait for an official response from the devs.

Hypermind · January 4, 2025, 1:34pm

A jump to a different region is not expected because the load limits of the local region are not reached. For example, a request from Europe should not be routed to America provided that the machine in Europe is accessible and not saturated.

Having just several HTTP requests and hitting regions across the globe by them… it does not sound right. Money-wise, traffic-wise, latency-wise - it is so wrong on many levels when it happens without a valid reason.

Topic		Replies	Views
Traffic (still) routed to instances not passing health check Questions / Help machines , autoscaling , proxy	10	139	March 7, 2025
Rolling update always causes downtime (proxy errors in log) autoscaling , proxy	9	76	March 18, 2025
Set proxy attempt limit autoscaling , proxy	13	111	December 13, 2024
Fly proxy timeouts despite healthy Machine VM Questions / Help	0	347	December 15, 2022
Error: failed to connect to fly machine: Supposedly started, and not stopped Questions / Help	10	3382	September 27, 2022