We have that limit to prevent crypto mining, mostly. But in general it’s better to run bigger VMs than more of them. This is because:
Our autoscaling is not designed for single request concurrency.
Most apps are so slow to boot, it’s better to run VMs that can handle 10-15 concurrent requests minimum.
For reference, one dedicated-cpu-1x instance is guaranteed about 20x the compute as a shared-cpu-1x. So if you’re running single request concurrency on a shared-cpu-1x, it’s better to change it to 20 requests on dedicated-cpu-1x.
I’m running autoscale in because every request has to spawn a Chromium process and the process can’t be shared between requests. This is a very particular thing, but the thig is the approach is working well to me!
When you said autoscale is not designed for single concurrency, I suspect the VM it’s not ready to be used since it needs to pull the image, do health check, etc.
Taking into account the design limitations, What could be a good minimum concurrency value?
For example: Do you think that makes sense setup soft to 1 and hard to 2 for avoiding the VM cold start?
As far as I know, fly.io does not keep warm instances during scaling. Once it kills an instance its gone and doesn’t get reused.
AWS Lambda’s reuse of VMs was more of an implementation detail than a specified feature, which is why they released minimum provisioned feature so you would have a minimum number of warm VMs.
If you only want to process 1 request concurrently per VM then you will definitely run into scaling issues. If you can modify your VM to have multiple Chromium processes then the autoscaling would work much better. Having a soft limit of 5 and hard limit of 10 would probably be the minimum for autoscaling to work well.
This is how I’d approach it too. A queue of requests that are dispatched to 1+ VMs each with a pool of Chromium processes. We’ll support scaling on custom metrics before too long, scaling Chromium VMs by queue depth sounds rad.
Probably this is very particular with the fact of running a background process like Chromium, where the CPU/MEM resources consumed by the process are not predictable and don’t scale in a linear way.
Let’s say for example you want to keep as many Chromium processes are vCPUs available.
Normally a Chromium process needs a target URL as an input, and that could vary the behavior widely (some URLs are fast, other URLs has a lot of scripts, etc).
That kind of corner situation can affect the performance of the on-fly requests, so for minimizing performance issues it’s preferable to use a horizontal scalability schema rather than vertical.
What I want to understand (and that’s probably a question for @kurt) is: What happens with the incoming requests when all current VM instances reach the hard limit?
If the incoming request needs to wait until a new VM spawn, that’s my big problem in this infrastructure schema.
That’s roughly correct. When VMs are at their hard limit and new connections show up, we queue them for a while waiting for a VM to come available. If that doesn’t happen in a reasonable amount of time, we serve a 502.