I’m running a service that shuts itself down after being idle for a few seconds because cold start times aren’t that big of a deal and workload is very bursted for this application. So I won’t need any machines for most of the time, then suddenly 10, and I want to shut them down as soon as possible.
Now here’s the problem. Once the machine shuts down, it takes Fly Proxy a few seconds to realize the machine is gone. It might get a request just after shutdown and keep trying to route this request to a dead machine. If that happens, it keeps trying 15 times which takes about 1 minute before moving on to another machine. No requests are lost, but response times become super slow in this edge case.
What’s worse is that a bunch of requests come in all at once, and they all get routed to this dead machine. I tried to set super low health check limits to mitigate this problem, but this just results in logs telling me my machine is unhealthy while the proxy keeps trying to reach it:
2024-11-27T03:31:02Z health[185e6d9f476408] fra [warn]Health check on port 8080 is in a 'warning' state. Your app may not be responding properly. Services exposed on ports [80, 443] may have intermittent failures until the health check passes.
x10
2024-11-27T03:31:03Z health[d8dd60dae201d8] fra [error]Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
x20
2024-11-27T03:33:05Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 18.189893117s so far)
x50
2024-11-27T03:33:25Z proxy[e286d70a9de728] fra [error][PM05] failed to connect to machine: gave up after 15 attempts (in 38.055196449s)
2024-11-27T03:33:25Z proxy[185e6d9f476408] fra [info]Starting machine
2024-11-27T03:33:25Z app[185e6d9f476408] fra [info]2024-11-27T03:33:25.419653603 [01JDNTR30BCWPPMEZ2NDB36WZC:main] Running Firecracker v1.7.0
2024-11-27T03:33:25Z proxy[e286d70a9de728] fra [error][PM05] failed to connect to machine: gave up after 15 attempts (in 38.202595242s)
x10
2024-11-27T03:33:25Z proxy[185e6d9f476408] fra [warn]Instance 185e6d9f476408 reached hard limit of 10 concurrent requests. This usually indicates your app is not responding fast enough for the traffic levels it is handling. Scaling resources, number of instances or increasing your hard limit might help.
2024-11-27T03:33:26Z app[185e6d9f476408] fra [info] INFO Starting init (commit: 0a21fde6)...
2024-11-27T03:33:26Z app[185e6d9f476408] fra [info] INFO Preparing to run: `/app/app` as root
2024-11-27T03:33:26Z app[185e6d9f476408] fra [info] INFO [fly api proxy] listening at /.fly/api
2024-11-27T03:33:26Z runner[185e6d9f476408] fra [info]Machine started in 978ms
2024-11-27T03:33:26Z proxy[185e6d9f476408] fra [info]machine started in 983.64346ms
The proxy does indeed retry connection attempts up to 15 times, but only if it’s starting the machine. If the proxy believes the machine is already started it will attempt to connect to it once, and if this fails the request is retried on a different machine (assuming there is one). So the retries that you see in the logs are for requests that initiated machine start.
Looking at the logs for e286d70a9de728:
Nov 27, 2024 @ 03:32:46.205448000 Starting machine <--- proxy initiated start
Nov 27, 2024 @ 03:32:47.245794000 machine started in 1.040293875s <--- machine has started (as reported by API), proxy starts connection attempts
Nov 27, 2024 @ 03:32:49.389519319 Shutting down due to inactivity <--- your app shuts down the machine
Nov 27, 2024 @ 03:32:53.317326000 waiting for machine to be reachable on 0.0.0.0:8080 (waited 6.071508375s so far) <--- proxy first reports that it still can't connect to the machine
So between 03:32:47.245794000 (when machine was started) and 03:32:49.389519319 (when shutdown was initiated) the proxy couldn’t connect to the machine. Are you sure your app was listening on the socket and accepting connections?
The proxy retries connection attempts with exponential backoff starting with 5ms sleep and up to 1s between attempts.
As for retries after the machine is shutdown - yeah, we do indeed not break out of the loop if the machine was first started by the proxy and then was immediately shut down before the proxy had a chance to connect to it. That’s an interesting corner case, I’ll see if we can reliably detect this condition.
I see, I guess there’s a chance that my idle detection is so aggressive that it shuts down the machine before the first request happens. I’ll fix that corner case and report back @pavel
hm. I don’t think that’s the issue. 4 parallel requests are now all trying multiple times before the machine does decide to shut down, 30s after boot because no request came in
2024-11-27T18:49:00Z proxy[e286d70a9de728] fra [info]Starting machine
2024-11-27T18:49:00Z app[e286d70a9de728] fra [info]2024-11-27T18:49:00.340963571 [01JDQF7PQQCGFZ7A5DST9VCSW0:main] Running Firecracker v1.7.0
2024-11-27T18:49:01Z app[e286d70a9de728] fra [info] INFO Starting init (commit: 0a21fde6)...
2024-11-27T18:49:01Z app[e286d70a9de728] fra [info] INFO Preparing to run: `/app/app` as root
2024-11-27T18:49:01Z app[e286d70a9de728] fra [info] INFO [fly api proxy] listening at /.fly/api
2024-11-27T18:49:01Z runner[e286d70a9de728] fra [info]Machine started in 1.092s
2024-11-27T18:49:01Z proxy[e286d70a9de728] fra [info]machine started in 1.10053636s
2024-11-27T18:49:01Z app[e286d70a9de728] fra [info]2024/11/27 18:49:01 INFO SSH listening listen_address=[fdaa:a:cb3c:a7b:86:80da:b74b:2]:22 dns_server=[fdaa::3]:53
2024-11-27T18:49:07Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 6.082947809s so far)
2024-11-27T18:49:07Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 6.083940009s so far)
2024-11-27T18:49:07Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 6.087695212s so far)
2024-11-27T18:49:07Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 6.097858148s so far)
2024-11-27T18:49:13Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 12.64486736s so far)
2024-11-27T18:49:14Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 12.69776888s so far)
2024-11-27T18:49:14Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 12.741306559s so far)
2024-11-27T18:49:14Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 12.750916528s so far)
2024-11-27T18:49:19Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 18.249820246s so far)
2024-11-27T18:49:19Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 18.307806598s so far)
2024-11-27T18:49:19Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 18.349814731s so far)
2024-11-27T18:49:19Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 18.527954798s so far)
2024-11-27T18:49:25Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 24.161969828s so far)
2024-11-27T18:49:25Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 24.312558305s so far)
2024-11-27T18:49:25Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 24.353805215s so far)
2024-11-27T18:49:25Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 24.532281092s so far)
2024-11-27T18:49:31Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 30.166023113s so far)
2024-11-27T18:49:31Z app[e286d70a9de728] fra [info]Shutting down due to inactivity
2024-11-27T18:49:31Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 30.266841034s so far)
2024-11-27T18:49:31Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 30.31593229s so far)
2024-11-27T18:49:31Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 30.521809191s so far)
2024-11-27T18:49:32Z app[e286d70a9de728] fra [info] INFO Main child exited normally with code: 0
2024-11-27T18:49:32Z app[e286d70a9de728] fra [info] INFO Starting clean up.
2024-11-27T18:49:32Z app[e286d70a9de728] fra [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2024-11-27T18:49:32Z app[e286d70a9de728] fra [info][ 31.874080] reboot: Restarting system
2024-11-27T18:49:32Z runner[e286d70a9de728] fra [info]machine exited with exit code 0, not restarting
2024-11-27T18:49:37Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 36.091708612s so far)
2024-11-27T18:49:37Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 36.270986522s so far)
2024-11-27T18:49:37Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 36.319996756s so far)
2024-11-27T18:49:37Z proxy[e286d70a9de728] fra [info]waiting for machine to be reachable on 0.0.0.0:8080 (waited 36.525968849s so far)
2024-11-27T18:49:39Z proxy[e286d70a9de728] fra [error][PM05] failed to connect to machine: gave up after 15 attempts (in 38.093447809s)
2024-11-27T18:49:39Z proxy[e286d70a9de728] fra [error][PM05] failed to connect to machine: gave up after 15 attempts (in 38.271997327s)
2024-11-27T18:49:39Z proxy[e286d70a9de728] fra [error][PM05] failed to connect to machine: gave up after 15 attempts (in 38.321406833s)
2024-11-27T18:49:39Z proxy[e286d70a9de728] fra [error][PM05] failed to connect to machine: gave up after 15 attempts (in 38.526888061s)
2024-11-27T18:49:40Z proxy[e784967f67d938] fra [info]Starting machine
... now that new machine handles the request just fine
I just deleted all my machines and spun up 10 new ones. I see requests being handled fine in 90% of cases. but in 10% this exact pattern appears. proxy seems to be unable to connect even though the app started just fine
you know what, I think there’s a much simpler problem here. The server sometimes hangs, no longer responding to requests, and no longer running the idle clean up loop. So some plain old bad coding might be at fault. I think we can assume that this is resolved, and the elevated error rate that I saw before should be resolved with the higher idle timeout before first request. So that was helpful intel.
After taking a break on this for a while, I’m still a little puzzled as to what exactly is going on. It appears that resolving a deadlock in my application does not yet fully resolve the issue here.
I see 4 out of 99 requests still failing. Logs attached below, I have a theory what’s happening but I still don’t understand why it’s happening.
I admit that my use case is very unique and I understand I’m probably using the fly.io system and its proxying capabilities in unexpected ways. I still think this should fail in repeated & circumventable patterns and I’m instead seeing some things I can’t quite explain. Here are some facts:
I have 10 machines in my app
My app sometimes gets a burst of 100 requests that all take 5 seconds to respond to
Once a machine has finished all work, and is not getting new requests within 2 seconds, it shuts down with status code 0. The proxy notices this and prints a message about not restarting.
In my fly toml I have soft limits at 8 and hard limits at 10
I added a ping endpoint and cron job into my machine boot script that pings the service every second (does not affect inactivity shutdown) and I can see the server responding when it’s supposed to. no dropped requests anymore after I resolved my bug.
The first batch of 100 requests always succeeds.
A second request always sees ~4 failures
My theory is what I described in my first post. The proxy doesn’t fully register that the machine shut down and still routes traffic to it
I’m now seeing a message that’s a little worrying in these new logs. It reads cannot retry/replay request because request body has been read past what we're willing to buffer
I see 3 of those messages, but 4 failed requests
Overall the only real answer here might be to switch up hosting for this app and move to lambda or similar server less providers that are better suited for bursted load like this
2024-11-29T20:27:27Z proxy[9185953a541248] fra [info]Starting machine
2024-11-29T20:27:27Z proxy[9185953a541248] fra [warn]Instance 9185953a541248 reached hard limit of 10 concurrent requests. This usually indicates your app is not responding fast enough for the traffic levels it is handling. Scaling resources, number of instances or increasing your hard limit might help.
2024-11-29T20:27:27Z app[9185953a541248] fra [info]2024-11-29T20:27:27.940044965 [01JDWSK3Y64QRFAWPRPVJCQFRT:main] Running Firecracker v1.7.0
2024-11-29T20:27:28Z app[9185953a541248] fra [info] INFO Starting init (commit: 0a21fde6)...
2024-11-29T20:27:28Z app[9185953a541248] fra [info] INFO Preparing to run: `/app/run.sh` as root
2024-11-29T20:27:28Z app[9185953a541248] fra [info] INFO [fly api proxy] listening at /.fly/api
2024-11-29T20:27:28Z app[9185953a541248] fra [info]Listening on 8080
2024-11-29T20:27:28Z runner[9185953a541248] fra [info]Machine started in 1.049s
2024-11-29T20:27:28Z proxy[9185953a541248] fra [info]machine started in 1.100734598s
2024-11-29T20:27:28Z app[9185953a541248] fra [info]OK: Server responded with 200
2024-11-29T20:27:28Z app[9185953a541248] fra [info]2024/11/29 20:27:28 INFO SSH listening listen_address=[fdaa:a:cb3c:a7b:c07e:1a6a:7b3a:2]:22 dns_server=[fdaa::3]:53
2024-11-29T20:27:29Z proxy[9185953a541248] fra [info]machine became reachable in 606.136994ms x10
2024-11-29T20:27:29Z app[9185953a541248] fra [info]OK: Server responded with 200 x2
2024-11-29T20:27:31Z app[9185953a541248] fra [info] (did work, success) x3
2024-11-29T20:27:32Z app[9185953a541248] fra [info]OK: Server responded with 200
2024-11-29T20:27:32Z app[9185953a541248] fra [info] (did work, success) x3
2024-11-29T20:27:32Z health[9185953a541248] fra [info]Health check on port 8080 is now passing.
2024-11-29T20:27:32Z app[9185953a541248] fra [info] (did work, success) x4
2024-11-29T20:27:33Z app[9185953a541248] fra [info]OK: Server responded with 200 x2
2024-11-29T20:27:35Z app[9185953a541248] fra [info]Shutting down due to inactivity
2024-11-29T20:27:35Z app[9185953a541248] fra [info]Done. Server has stopped.
2024-11-29T20:27:35Z app[9185953a541248] fra [info] INFO Main child exited normally with code: 0
2024-11-29T20:27:35Z app[9185953a541248] fra [info] INFO Starting clean up.
2024-11-29T20:27:35Z app[9185953a541248] fra [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2024-11-29T20:27:35Z app[9185953a541248] fra [info][ 7.801358] reboot: Restarting system
2024-11-29T20:27:36Z health[9185953a541248] fra [error]Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
2024-11-29T20:27:36Z runner[9185953a541248] fra [info]machine exited with exit code 0, not restarting
2024-11-29T20:27:37Z proxy[9185953a541248] fra [warn]Instance 9185953a541248 reached hard limit of 10 concurrent requests. This usually indicates your app is not responding fast enough for the traffic levels it is handling. Scaling resources, number of instances or increasing your hard limit might help.
2024-11-29T20:27:38Z proxy[9185953a541248] fra [error][PA01] cannot retry/replay request because request body has been read past what we're willing to buffer x2
2024-11-29T20:27:39Z proxy[9185953a541248] fra [error][PC05] timed out while connecting to your instance. this indicates a problem with your app (hint: look at your logs and metrics) x10
2024-11-29T20:27:43Z proxy[9185953a541248] fra [info]Starting machine
2024-11-29T20:27:43Z proxy[9185953a541248] fra [warn]Instance 9185953a541248 reached hard limit of 10 concurrent requests. This usually indicates your app is not responding fast enough for the traffic levels it is handling. Scaling resources, number of instances or increasing your hard limit might help.
2024-11-29T20:27:43Z app[9185953a541248] fra [info]2024-11-29T20:27:43.494182982 [01JDWSK3Y64QRFAWPRPVJCQFRT:main] Running Firecracker v1.7.0
2024-11-29T20:27:44Z app[9185953a541248] fra [info] INFO Starting init (commit: 0a21fde6)...
2024-11-29T20:27:44Z app[9185953a541248] fra [info] INFO Preparing to run: `/app/run.sh` as root
2024-11-29T20:27:44Z app[9185953a541248] fra [info] INFO [fly api proxy] listening at /.fly/api
2024-11-29T20:27:44Z app[9185953a541248] fra [info]Listening on 8080
2024-11-29T20:27:44Z app[9185953a541248] fra [info]OK: Server responded with 200
2024-11-29T20:27:44Z runner[9185953a541248] fra [info]Machine started in 1.094s
2024-11-29T20:27:44Z app[9185953a541248] fra [info]2024/11/29 20:27:44 INFO SSH listening listen_address=[fdaa:a:cb3c:a7b:c07e:1a6a:7b3a:2]:22 dns_server=[fdaa::3]:53
2024-11-29T20:27:44Z proxy[9185953a541248] fra [info]machine started in 1.179096677s
2024-11-29T20:27:44Z proxy[9185953a541248] fra [info]machine became reachable in 5.733358ms x10
2024-11-29T20:27:45Z app[9185953a541248] fra [info] (did work, success)
2024-11-29T20:27:45Z app[9185953a541248] fra [info]OK: Server responded with 200 x2
2024-11-29T20:27:47Z app[9185953a541248] fra [info] (did work, success) x2
2024-11-29T20:27:47Z health[9185953a541248] fra [info]Health check on port 8080 is now passing.
2024-11-29T20:27:47Z app[9185953a541248] fra [info] (did work, success) x7
2024-11-29T20:27:47Z app[9185953a541248] fra [info]OK: Server responded with 200 x3
2024-11-29T20:27:50Z app[9185953a541248] fra [info]Shutting down due to inactivity
2024-11-29T20:27:50Z app[9185953a541248] fra [info]Done. Server has stopped.
2024-11-29T20:27:51Z app[9185953a541248] fra [info] INFO Main child exited normally with code: 0
2024-11-29T20:27:51Z app[9185953a541248] fra [info] INFO Starting clean up.
2024-11-29T20:27:51Z app[9185953a541248] fra [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2024-11-29T20:27:51Z app[9185953a541248] fra [info][ 7.801152] reboot: Restarting system
2024-11-29T20:27:52Z runner[9185953a541248] fra [info]machine exited with exit code 0, not restarting
2024-11-29T20:27:54Z health[9185953a541248] fra [error]Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
Resources are finite. I haven’t encountered the issue where the fly proxy send requests to a dead machine, it has always routed to an operational one.
The only time where I encountered the issue you’ve described was due to a bad nginx server config I was using.
Just to understand your workload better. Are you sending the second batch of requests while the machines are running (e.g. withing the 2 seconds inactivity timeout)? Or are you sending the first batch, wait for all machines to stop and then send the second batch?