autoscale not scaling

Doing a load test for 5 minutes does not trigger autoscaling even with multiple warnings from the proxy:

2022-07-25T18:32:43Z proxy[eb2ee7d6] ams [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[eb2ee7d6] ams [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[05525023] sea [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[05525023] sea [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[05525023] sea [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[05525023] sea [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[eb2ee7d6] ams [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[eb2ee7d6] ams [warn]Instance reached connections hard limit of 25


Going thru the previous threads about autoscale not scaling which state:

  1. Autoscale doesn’t work with multi-process. Autoscale doesn't work - #3 by kurt

  2. Even then, autoscale doesn’t work when you have a processes section. How to disable count scaling, to enable autoscaling? - #2 by amithm7

  3. Autoscale doesn’t work when you have max-per-region How to disable count scaling, to enable autoscaling? - #4 by amithm7

  4. Autoscale could take several minutes to scale (though this one is pretty old) Autoscale doesn't seem to work with hard_limit = 1 and soft_limit = 1 - #2 by kurt

My configuration doesn’t seem to have any of these issues. Anyone have any ideas?

$ fly autoscale show
     Scale Mode: Balanced
      Min Count: 2
      Max Count: 7
$ fly scale show
VM Resources for testmysql
        VM Size: shared-cpu-1x
      VM Memory: 512 MB
          Count: 2
 Max Per Region: Not set
$ cat fly.toml 
app = "testmysql"
kill_signal = "SIGINT"
kill_timeout = 5

[deploy]
  release_command = "/app/bin/migrate"

[env]

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 4000
  protocol = "tcp"
  script_checks = []
  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"
1 Like

Just to ensure I wasn’t crazy, I created yet another app to duplicate the behaviour. Ran the following commands and only did a single deploy. Same behaviour.

    fly scale memory 512
    fly regions add ams
    fly regions add mia
    fly autoscale balanced min=2 max=7
    fly deploy -i registry.fly.io/blabla:0.1

Thanks for reporting this, we’re looking into it and will keep you posted!

Thanks. I even took out the env and experimental blocks and re-ran, but still same behaviour.

I think I found an internal bug possibly causing this issue, and I’ve updated your instance definitions with a tentative fix. Could you do another load test on the running instances (without re-deploying) and let me know if it works this time?

Re-ran test, still no scaling on testmysql.

It is scaling correctly on the 2nd one I deployed. It’s up to 5 instances now. Takes about 2 minutes to detect I think, wasn’t really timing it.

The 2nd app is scaling correctlly. Takes about 60-90 seconds to come up. Scale down takes about 10 minutes. Is that the expected behaviour?

testmysql is borked in some strange fashion. The logs show that deployments are being attempted and at a very quick pace to scale up, but no instances actually get created.

From the 2nd app:

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS  HEALTH CHECKS           RESTARTS        CREATED   
55ac6af4        app     11      mia     run     running 1 total, 1 passing      0               3m21s ago
700bd4ca        app     11      mia     run     running 1 total, 1 passing      0               5m31s ago
eda38388        app     11      sea     run     running 1 total, 1 passing      0               7m14s ago
fc4eb3a9        app     11      sea     run     running 1 total, 1 passing      0               8m31s ago
a323536a        app     11      ams     run     running 1 total, 1 passing      0               1h48m ago
04d693f9        app     11      ams     run     running 1 total, 1 passing      0               1h48m ago

Thanks for the additional testing! I’ll work on rolling out the fix more permanently.

The autoscaling service is designed to scale up quickly and scale down slowly with a scale-down lag of 10 minutes, so yes that’s the expected behavior.

I think the remaining problem with testmysql may be a separate bug in our deployment system- it looks like you had launched two apps both referencing the same container image, deleted one of them, and now the remaining app (testmysql) can’t locate its image in the registry. We’ll look into this issue as well but for now re-deploying the app should get you unstuck.

OK, please let me know when the fix is out.

I’m still porting, so I’ll leave these up for you to test/debug with if necessary for the next few days.

Also, another potential bug.

  1. Was testing out machines - created an app called “hora”.
  2. The docker deployment was borked, so instead of deleting the machine, I deleted the entire app.
  3. Created a new app called “hora”.
  4. Deployed successfully as an app.
  5. Never successfully connected to it as the proxy was having issues with ssl or perhaps a stale dns entry.

The fix has been deployed, so autoscaling should be working for any future deployments now :rocket:

1 Like

FYI, there’s a race condition with autoscaling that may be tricky to re-produce.

Essentially, the tcp connection health check could fail as it seems traffic gets redirected to the app before the tcp check passes. The vm hits the hard connection limit and will never be healthy at that point. Autoscaling will stop scaling at that point.

3 Likes

Hello there
I have the same issue, and have no idea how to trigger the autoscale
We set hard limit to 50 and softlimit to 25
and set autoscale balanced min=2 max=6

when we test with request with 200 req, we just got about 80% success request, and the rest is dropped, and autoscale is not triggered

Is there any idea about this?

Even though autoscaling isn’t without problems… verify if limits are set against (tcp) connections or (http) requests (docs), critical distinction between the two since one TCP connection may pipeline several http requests.

Try the new Machines platform which gives more control over how instances are spun up and down in response to demand. A big (documented) limitation right now is that the Machines platform doesn’t deal with the placement of 2 or more machines in the same region the way it should (also see).

We has been trying both
when using connections, we got warning like sin[warn]Instance reached connections hard limit of 50
and when we use requests we got no warning but request is not accepted

1 Like

BTW our app is laravel with nginx, php and supervisor service. is that affect to autoscaling?
I see this thread that make some clue for us

1 Like

Could you quickly walk me through how you did the load test? Any resource/article that I could learn from? I would like to learn:

  • How many instances I would need? Which types? (dedicated CPU or shared CPU), how much ram?
  • How to properly set the soft-limit and hard-limit?