I’m deploying an application to serve as a token auth / routing platform, and in testing the auto scaler seems to add a node, mostly as expected.
What I am curious about is: By what metric does the auto-scaler actually scale? Is it on a threshold of CPU usage, HTTP response code error rates, etc?
Can the thresholds be tuned?
The application I am working on can be relatively bursty, based on external conditions, and I would prefer it to scale up sooner than it currently does so I reduce the number of 502 response codes. In testing, I start to get gateway unavailable responses before the next node starts to provision, and these response codes seem to start at about 100% CPU usage.
Ideally more nodes would spin up sooner than later, I don’t mind the extra cost of having an extra node or two around to avoid customer complaints, and the cost gets passed through anyway… so yeah.
For others in the future:
The app is PHP-FPM based, with 3-6 workers. With 1 vCPU (Dedicated), I seem to exhaust CPU power before getting to 12 concurrent connections, with a sustained peak of about 440 requests per second per vCPU. Auto scaling values seem linear with vCPU, so these (obviously depending on your app it will be different) may be a good starting point for you.
More workers is possible, and does work, but I tend to exhaust CPU usage before the workers saturate well, and this slows down the TTLB for clients.