Autoscaling is constantly stopping and starting instances even with absurdly high soft_limit of 100k

I just can’t seem to figure out how to configure autoscaling correctly or it is not working as expected.

I have an app with 6 machines running (4 in fra, 2 in arn) and according to the App Concurrency metrics in Grafana, their concurrency is rarely higher than 5, which is already oddly low to me, although I know that the app is serving thousands of requests at a time.

Any value I’ve tried with soft_limit, whether it’s 3, 100, 500 or even 100000, the autoscaler is always stopping a machine, cpu usage goes up, autoscaler starts a machine, cpu usage goes down, autoscaler stops a machine. This is a constant cycle that goes on and on and is very annoying. I’d have expected that machines are staying stopped for longer until the app load has actually increased and more machines are needed.

These unexpected behaviours combined with docs where information is spread out across different pages and the introduction of cpu quotas, makes me partially wanting to migrate off from Fly, if it wasn’t for the flexibility and otherwise good developer experience to get an app up and running. (Sorry for the rant, I’m just frustrated at this point)

I’d highly appreciate it if someone could help me on this matter, since I feel like I’m too dumb to use Fly properly.

not an answer but perhaps some pointers:

  1. according to this reply you might want to tweak the query in grafana to get a more representative graph of the concurrent requests. I experimented with it in one of my test apps and the difference is considerable. Whether it’s also useful… hard to say :smiley:

  2. soft_limit is “just a hint” for Fly Proxy, the proxy will consider other things as well like, for example, what’s the machine nearest to the request :point_right: since you have machines in different regions do you see anything on this front that might help?

Docs here specifically mention that the autostop/start decisions are by region

  1. AFAIK soft_limit is used by the proxy also as one of several inputs to decide whether the instance has capacity in excess or not to shut it down… so could be a double edge sword (e.g. having soft_limit=100,000 might nudge the proxy to shut the machine down if it’s getting only 20,000 requests and other machines are far from their hard_limit, if any)

Question: have you experimented with removing soft_limits completely to start from a clean baseline? it should default to 20 if I remember correctly

Thanks for your help!

  1. It’s giving me results which I’ve tried before. My app can handle around 1500 concurrent requests with the current machine specs, so I’ve tried hard_limit to 1000 and soft_limit 500/750 before, but without success. Machines are still constantly stopping and starting at the next autoscale interval.
  2. Busy hours are from 3pm till 12am in Europe for my app, that’s the majority of my users. The machines are placed in fra (Frankfurt, Germany) and arn (Stockholm, Sweden), which should already give the closest region for users.
  3. 100k soft_limit was only an example/test to see if autoscaling would stop machines and not start them again, since 100k concurrent requests should be very far from current concurrent requests. The autoscaler is still behaving unexpectedly, which kinda makes me think it’s actually broken.

Question: have you experimented with removing soft_limits completely to start from a clean baseline? it should default to 20 if I remember correctly

I have tried that as well, changed nothing.

I might end up disabling autoscaler and scale manually via the API, since I’ve a worker app running anyway. Scale up at 3pm and scale down at 12am ¯_(ツ)_/¯

Funnily enough, other threads I’ve found about autoscaler issues, in half of them the Fly team said that there was bug and they fixed it.