Auto-scaling based on response time?

Hello,

Reading about how auto-scaling currently takes a while to kick in because of the time to gather metrics (Autoscale doesn't seem to launch new instances - #5 by kurt) … I was thinking … what would be neat IMO is if instead it was based on time-spent-waiting-for-request-to-be-served.

For example if the response does not come within e.g 2000ms (set in config) then the instance is under too much load. And so a new instance should be started.

That is perhaps easier for a user than using concurrent connections, since personally I’ve found it hard to pick good values. If the hard limit is set too low then requests that could be handled won’t be, and so are dropped (until auto-scaling happens 60s later). But set it too high and the instance runs out of resources, so the request also fails.

Using response time would (maybe) help with those issues. Perhaps it would cause others. Just a random thought for the to-do list :slight_smile:

It’s an interesting idea, but has a dangerous scenario - if the long response times are caused not by the application instance itself, but e.g an overloaded database, adding more instances will only make it worse.

2 Likes

We have some ideas for how to implement this. In theory, there are heuristics you could use to test the effect of adding VMs. It should be possible to try adding an extra VM if response times get slow or queues back up and then see if it helped.

It will likely be a while before we get to this, though. Once we can responsively launch new VMs at request time, I think we’ll learn a lot about how to actually make better scaling choices. This stuff is fun to tune but we’d rather you all don’t have to think about it.

2 Likes

I suppose the ideal system would first establish a strong correlation with CPU/RAM/connection counts and response time before using one or more of them as the trigger.

1 Like

Thanks, good points from you all. It’s hard to get right for all cases.

For me I think the ideal would be considering each region independently, summing max disk usage, max memory, total CPU used, total ingress used, total egress used across all instances of the region for the past 5s. Then have high watermark and low watermark for each metric.

If the watermark triggers scale up by 4x, and down by 0.25x-ceil. Ie. 1 → 4 → 3 → 2 → 1.

This considers load balancing across the provisioned instances a separate problem.

Fancy bonus points for adapting the provisioning profile of each app so they can trade off CPU vs memory vs egress: “you appear egress limited, so you get lots of smaller nodes instead of lots of medium sized ones” or vice-versa, “we’ll swap out these instances for a single bigger one”