autoscale not scaling

tj1 · July 25, 2022, 6:45pm

Doing a load test for 5 minutes does not trigger autoscaling even with multiple warnings from the proxy:

2022-07-25T18:32:43Z proxy[eb2ee7d6] ams [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[eb2ee7d6] ams [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[05525023] sea [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[05525023] sea [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[05525023] sea [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[05525023] sea [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[eb2ee7d6] ams [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[eb2ee7d6] ams [warn]Instance reached connections hard limit of 25

Going thru the previous threads about autoscale not scaling which state:

Autoscale doesn’t work with multi-process. Autoscale doesn't work - #3 by kurt
Even then, autoscale doesn’t work when you have a processes section. How to disable count scaling, to enable autoscaling? - #2 by amithm7
Autoscale doesn’t work when you have max-per-region How to disable count scaling, to enable autoscaling? - #4 by amithm7
Autoscale could take several minutes to scale (though this one is pretty old) Autoscale doesn't seem to work with hard_limit = 1 and soft_limit = 1 - #2 by kurt

My configuration doesn’t seem to have any of these issues. Anyone have any ideas?

$ fly autoscale show
     Scale Mode: Balanced
      Min Count: 2
      Max Count: 7
$ fly scale show
VM Resources for testmysql
        VM Size: shared-cpu-1x
      VM Memory: 512 MB
          Count: 2
 Max Per Region: Not set

$ cat fly.toml 
app = "testmysql"
kill_signal = "SIGINT"
kill_timeout = 5

[deploy]
  release_command = "/app/bin/migrate"

[env]

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 4000
  protocol = "tcp"
  script_checks = []
  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

tj1 · July 25, 2022, 7:16pm

Just to ensure I wasn’t crazy, I created yet another app to duplicate the behaviour. Ran the following commands and only did a single deploy. Same behaviour.

    fly scale memory 512
    fly regions add ams
    fly regions add mia
    fly autoscale balanced min=2 max=7
    fly deploy -i registry.fly.io/blabla:0.1

wjordan · July 25, 2022, 7:29pm

Thanks for reporting this, we’re looking into it and will keep you posted!

tj1 · July 25, 2022, 7:31pm

Thanks. I even took out the env and experimental blocks and re-ran, but still same behaviour.

wjordan · July 25, 2022, 7:57pm

I think I found an internal bug possibly causing this issue, and I’ve updated your instance definitions with a tentative fix. Could you do another load test on the running instances (without re-deploying) and let me know if it works this time?

tj1 · July 25, 2022, 8:03pm

Re-ran test, still no scaling on testmysql.

tj1 · July 25, 2022, 8:18pm

It is scaling correctly on the 2nd one I deployed. It’s up to 5 instances now. Takes about 2 minutes to detect I think, wasn’t really timing it.

tj1 · July 25, 2022, 9:05pm

The 2nd app is scaling correctlly. Takes about 60-90 seconds to come up. Scale down takes about 10 minutes. Is that the expected behaviour?

testmysql is borked in some strange fashion. The logs show that deployments are being attempted and at a very quick pace to scale up, but no instances actually get created.

From the 2nd app:

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS  HEALTH CHECKS           RESTARTS        CREATED   
55ac6af4        app     11      mia     run     running 1 total, 1 passing      0               3m21s ago
700bd4ca        app     11      mia     run     running 1 total, 1 passing      0               5m31s ago
eda38388        app     11      sea     run     running 1 total, 1 passing      0               7m14s ago
fc4eb3a9        app     11      sea     run     running 1 total, 1 passing      0               8m31s ago
a323536a        app     11      ams     run     running 1 total, 1 passing      0               1h48m ago
04d693f9        app     11      ams     run     running 1 total, 1 passing      0               1h48m ago

wjordan · July 25, 2022, 9:17pm

Thanks for the additional testing! I’ll work on rolling out the fix more permanently.

The autoscaling service is designed to scale up quickly and scale down slowly with a scale-down lag of 10 minutes, so yes that’s the expected behavior.

I think the remaining problem with testmysql may be a separate bug in our deployment system- it looks like you had launched two apps both referencing the same container image, deleted one of them, and now the remaining app (testmysql) can’t locate its image in the registry. We’ll look into this issue as well but for now re-deploying the app should get you unstuck.

tj1 · July 25, 2022, 9:28pm

OK, please let me know when the fix is out.

I’m still porting, so I’ll leave these up for you to test/debug with if necessary for the next few days.

Also, another potential bug.

Was testing out machines - created an app called “hora”.
The docker deployment was borked, so instead of deleting the machine, I deleted the entire app.
Created a new app called “hora”.
Deployed successfully as an app.
Never successfully connected to it as the proxy was having issues with ssl or perhaps a stale dns entry.

wjordan · July 26, 2022, 3:56am

The fix has been deployed, so autoscaling should be working for any future deployments now

tj1 · July 27, 2022, 12:55am

FYI, there’s a race condition with autoscaling that may be tricky to re-produce.

Essentially, the tcp connection health check could fail as it seems traffic gets redirected to the app before the tcp check passes. The vm hits the hard connection limit and will never be healthy at that point. Autoscaling will stop scaling at that point.

karyakarsa · October 9, 2022, 2:30pm

Hello there
I have the same issue, and have no idea how to trigger the autoscale
We set hard limit to 50 and softlimit to 25
and set autoscale balanced min=2 max=6

when we test with request with 200 req, we just got about 80% success request, and the rest is dropped, and autoscale is not triggered

Is there any idea about this?

ignoramous · October 9, 2022, 2:45pm

Even though autoscaling isn’t without problems… verify if limits are set against (tcp) connections or (http) requests (docs), critical distinction between the two since one TCP connection may pipeline several http requests.

Try the new Machines platform which gives more control over how instances are spun up and down in response to demand. A big (documented) limitation right now is that the Machines platform doesn’t deal with the placement of 2 or more machines in the same region the way it should (also see).

karyakarsa · October 10, 2022, 12:44am

We has been trying both
when using connections, we got warning like sin[warn]Instance reached connections hard limit of 50
and when we use requests we got no warning but request is not accepted

karyakarsa · October 10, 2022, 2:39am

BTW our app is laravel with nginx, php and supervisor service. is that affect to autoscaling?
I see this thread that make some clue for us

huytrinh · November 18, 2022, 10:48pm

Could you quickly walk me through how you did the load test? Any resource/article that I could learn from? I would like to learn:

How many instances I would need? Which types? (dedicated CPU or shared CPU), how much ram?
How to properly set the soft-limit and hard-limit?

Topic		Replies	Views
Autoscale doesn't seem to launch new instances Questions / Help	6	878	September 17, 2021
Autoscale doesn't seem to work with hard_limit = 1 and soft_limit = 1	13	1323	September 7, 2021
How long does it take for autoscaling to kick in.	4	453	September 24, 2022
Autoscaling is not triggered on a pure websocket application Questions / Help elixir	23	1879	November 4, 2022
Autoscale config has not updated hard limit	1	287	August 8, 2022

autoscale not scaling

Related topics