Machines constantly scaling up and down after enabling auto_stop_machines

Unfortunately, I don’t have a lot of time right now to describe the issue, but it’s easy to reproduce. I can add more details later.

Configuration

[[services]]
  protocol = "tcp"
  internal_port = 8080
  auto_start_machines = true
  auto_stop_machines = true
  min_machines_running = 2

Formation

I deployed the app to three regions:

  • iad (primary) – 1 machine
  • lhr – 1 machine
  • gru – 1 machine

Issue

I’ve been performing a loading test for the last two weeks and noticed an issue with auto_stop_machines. The application is consistently receiving requests (~ 1000 requests per minute).

Constant restarts cause high tail latency.

Look what happens as soon as I set auto_stop_machines=false:

1 Like

Checking on our end, in your configuration, the primary region is gru, not iad. min_machines_running keeps the specified number of machines running only in your primary region, not globally. So, machines in lhr and iad will be stopped if the proxy sees there’s no traffic at to those machines at the time the downscaler runs. And since you only have 1 machine running in gru, it hasn’t been stopped at all.

However, the stopping and starting is an issue regardless. I’m looking into it at the moment. Am I right in saying the issue here is that machines are being stopped in the first place? That you’d expect that due to the consistent traffic, they’ll remain running?

1 Like

Thank you for looking into it.

Oh, I see. That is new for me.

Yeah, this is the issue. The machines are being stopped only to be started again in the next second.

Same issue maybe?

Yeah, I just checked the docs and it seems the machine should only stop if it has no traffic.

The current behaviour causes high tail latency / cold start issues as machines are restarted every minute or so.

auto_stop_machines: Whether to automatically stop an application’s machines when there’s excess capacity, per region. If there’s only one machine in a region, then the machine is stopped if it has no traffic. The Fly Proxy runs a process to automatically stop machines every few minutes. The default is true.

Fly Launch configuration (fly.toml) · Fly Docs

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

@containerops we’ve fixed a bug where we were stopping machines when we shouldn’t have. Can you try enabling autostop to see if you still experience high tail latency?

Chiming in here… We’re still seeing this issue as of Sep 18. Doesn’t make sense that 148ed has traffic and then shuts itself down, only to restart itself one second later.

This is resulting in terrible HTTP response times: