I’ve been performing a loading test for the last two weeks and noticed an issue with auto_stop_machines. The application is consistently receiving requests (~ 1000 requests per minute).
Constant restarts cause high tail latency.
Look what happens as soon as I set auto_stop_machines=false:
Checking on our end, in your configuration, the primary region is gru, not iad. min_machines_running keeps the specified number of machines running only in your primary region, not globally. So, machines in lhr and iad will be stopped if the proxy sees there’s no traffic at to those machines at the time the downscaler runs. And since you only have 1 machine running in gru, it hasn’t been stopped at all.
However, the stopping and starting is an issue regardless. I’m looking into it at the moment. Am I right in saying the issue here is that machines are being stopped in the first place? That you’d expect that due to the consistent traffic, they’ll remain running?
Yeah, I just checked the docs and it seems the machine should only stop if it has no traffic.
The current behaviour causes high tail latency / cold start issues as machines are restarted every minute or so.
auto_stop_machines: Whether to automatically stop an application’s machines when there’s excess capacity, per region. If there’s only one machine in a region, then the machine is stopped if it has no traffic. The Fly Proxy runs a process to automatically stop machines every few minutes. The default is true.
@containerops we’ve fixed a bug where we were stopping machines when we shouldn’t have. Can you try enabling autostop to see if you still experience high tail latency?
Chiming in here… We’re still seeing this issue as of Sep 18. Doesn’t make sense that 148ed has traffic and then shuts itself down, only to restart itself one second later.