Auto Scaling - The threshold of when to scale up.

I’m deploying an application to serve as a token auth / routing platform, and in testing the auto scaler seems to add a node, mostly as expected.

What I am curious about is: By what metric does the auto-scaler actually scale? Is it on a threshold of CPU usage, HTTP response code error rates, etc?

Can the thresholds be tuned?

The application I am working on can be relatively bursty, based on external conditions, and I would prefer it to scale up sooner than it currently does so I reduce the number of 502 response codes. In testing, I start to get gateway unavailable responses before the next node starts to provision, and these response codes seem to start at about 100% CPU usage.

Ideally more nodes would spin up sooner than later, I don’t mind the extra cost of having an extra node or two around to avoid customer complaints, and the cost gets passed through anyway… so yeah.

1 Like

From my understanding, the scaling and re-routing is based on the soft-limit.

    hard_limit = 160
    soft_limit = 100
    type = "connections"

I believe there are two types - requests and connections, but I can’t seem to find it in the documentation.

Interesting, I’ll give that a shot and follow up here.

Following up here, I’ve got concurrency set pretty low to get it to auto scale.

    hard_limit = 10
    soft_limit = 6

For others in the future:
The app is PHP-FPM based, with 3-6 workers. With 1 vCPU (Dedicated), I seem to exhaust CPU power before getting to 12 concurrent connections, with a sustained peak of about 440 requests per second per vCPU. Auto scaling values seem linear with vCPU, so these (obviously depending on your app it will be different) may be a good starting point for you.

More workers is possible, and does work, but I tend to exhaust CPU usage before the workers saturate well, and this slows down the TTLB for clients.

1 Like

Maybe try type = requests?

It’s not documented, but hopefully it works better for php.

The type = requests didn’t seem to do anything noticeable.

However, tuning the number of workers in the FPM pool did seem to allow scaling to be more apparent, which was a good finding overall.

1 Like

I wonder what difference there is between connections and requests… They seem analogous (from a reverse-proxy point of view).

Requests = https requests?

Connections = concurrent tcp connections?

That’s how i interpreted them :sweat_smile: