Auto Scaling - The threshold of when to scale up.

I’m deploying an application to serve as a token auth / routing platform, and in testing the auto scaler seems to add a node, mostly as expected.

What I am curious about is: By what metric does the auto-scaler actually scale? Is it on a threshold of CPU usage, HTTP response code error rates, etc?

Can the thresholds be tuned?

The application I am working on can be relatively bursty, based on external conditions, and I would prefer it to scale up sooner than it currently does so I reduce the number of 502 response codes. In testing, I start to get gateway unavailable responses before the next node starts to provision, and these response codes seem to start at about 100% CPU usage.

Ideally more nodes would spin up sooner than later, I don’t mind the extra cost of having an extra node or two around to avoid customer complaints, and the cost gets passed through anyway… so yeah.

From my understanding, the scaling and re-routing is based on the soft-limit.

  [services.concurrency]
    hard_limit = 160
    soft_limit = 100
    type = "connections"

I believe there are two types - requests and connections, but I can’t seem to find it in the documentation.

Interesting, I’ll give that a shot and follow up here.

Following up here, I’ve got concurrency set pretty low to get it to auto scale.

  [services.concurrency]
    hard_limit = 10
    soft_limit = 6

For others in the future:
The app is PHP-FPM based, with 3-6 workers. With 1 vCPU (Dedicated), I seem to exhaust CPU power before getting to 12 concurrent connections, with a sustained peak of about 440 requests per second per vCPU. Auto scaling values seem linear with vCPU, so these (obviously depending on your app it will be different) may be a good starting point for you.

More workers is possible, and does work, but I tend to exhaust CPU usage before the workers saturate well, and this slows down the TTLB for clients.

Maybe try type = requests?

It’s not documented, but hopefully it works better for php.

The type = requests didn’t seem to do anything noticeable.

However, tuning the number of workers in the FPM pool did seem to allow scaling to be more apparent, which was a good finding overall.

I wonder what difference there is between connections and requests… They seem analogous (from a reverse-proxy point of view).

Requests = https requests?

Connections = concurrent tcp connections?

That’s how i interpreted them :sweat_smile: