autoscale not scaling

Doing a load test for 5 minutes does not trigger autoscaling even with multiple warnings from the proxy:

2022-07-25T18:32:43Z proxy[eb2ee7d6] ams [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[eb2ee7d6] ams [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[05525023] sea [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[05525023] sea [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[05525023] sea [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[05525023] sea [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[eb2ee7d6] ams [warn]Instance reached connections hard limit of 25
2022-07-25T18:32:43Z proxy[eb2ee7d6] ams [warn]Instance reached connections hard limit of 25


Going thru the previous threads about autoscale not scaling which state:

  1. Autoscale doesn’t work with multi-process. Autoscale doesn't work - #3 by kurt

  2. Even then, autoscale doesn’t work when you have a processes section. How to disable count scaling, to enable autoscaling? - #2 by amithm7

  3. Autoscale doesn’t work when you have max-per-region How to disable count scaling, to enable autoscaling? - #4 by amithm7

  4. Autoscale could take several minutes to scale (though this one is pretty old) Autoscale doesn't seem to work with hard_limit = 1 and soft_limit = 1 - #2 by kurt

My configuration doesn’t seem to have any of these issues. Anyone have any ideas?

$ fly autoscale show
     Scale Mode: Balanced
      Min Count: 2
      Max Count: 7
$ fly scale show
VM Resources for testmysql
        VM Size: shared-cpu-1x
      VM Memory: 512 MB
          Count: 2
 Max Per Region: Not set
$ cat fly.toml 
app = "testmysql"
kill_signal = "SIGINT"
kill_timeout = 5

[deploy]
  release_command = "/app/bin/migrate"

[env]

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 4000
  protocol = "tcp"
  script_checks = []
  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

Just to ensure I wasn’t crazy, I created yet another app to duplicate the behaviour. Ran the following commands and only did a single deploy. Same behaviour.

    fly scale memory 512
    fly regions add ams
    fly regions add mia
    fly autoscale balanced min=2 max=7
    fly deploy -i registry.fly.io/blabla:0.1

Thanks for reporting this, we’re looking into it and will keep you posted!

Thanks. I even took out the env and experimental blocks and re-ran, but still same behaviour.

I think I found an internal bug possibly causing this issue, and I’ve updated your instance definitions with a tentative fix. Could you do another load test on the running instances (without re-deploying) and let me know if it works this time?

Re-ran test, still no scaling on testmysql.

It is scaling correctly on the 2nd one I deployed. It’s up to 5 instances now. Takes about 2 minutes to detect I think, wasn’t really timing it.

The 2nd app is scaling correctlly. Takes about 60-90 seconds to come up. Scale down takes about 10 minutes. Is that the expected behaviour?

testmysql is borked in some strange fashion. The logs show that deployments are being attempted and at a very quick pace to scale up, but no instances actually get created.

From the 2nd app:

Instances
ID              PROCESS VERSION REGION  DESIRED STATUS  HEALTH CHECKS           RESTARTS        CREATED   
55ac6af4        app     11      mia     run     running 1 total, 1 passing      0               3m21s ago
700bd4ca        app     11      mia     run     running 1 total, 1 passing      0               5m31s ago
eda38388        app     11      sea     run     running 1 total, 1 passing      0               7m14s ago
fc4eb3a9        app     11      sea     run     running 1 total, 1 passing      0               8m31s ago
a323536a        app     11      ams     run     running 1 total, 1 passing      0               1h48m ago
04d693f9        app     11      ams     run     running 1 total, 1 passing      0               1h48m ago

Thanks for the additional testing! I’ll work on rolling out the fix more permanently.

The autoscaling service is designed to scale up quickly and scale down slowly with a scale-down lag of 10 minutes, so yes that’s the expected behavior.

I think the remaining problem with testmysql may be a separate bug in our deployment system- it looks like you had launched two apps both referencing the same container image, deleted one of them, and now the remaining app (testmysql) can’t locate its image in the registry. We’ll look into this issue as well but for now re-deploying the app should get you unstuck.

OK, please let me know when the fix is out.

I’m still porting, so I’ll leave these up for you to test/debug with if necessary for the next few days.

Also, another potential bug.

  1. Was testing out machines - created an app called “hora”.
  2. The docker deployment was borked, so instead of deleting the machine, I deleted the entire app.
  3. Created a new app called “hora”.
  4. Deployed successfully as an app.
  5. Never successfully connected to it as the proxy was having issues with ssl or perhaps a stale dns entry.

The fix has been deployed, so autoscaling should be working for any future deployments now :rocket:

FYI, there’s a race condition with autoscaling that may be tricky to re-produce.

Essentially, the tcp connection health check could fail as it seems traffic gets redirected to the app before the tcp check passes. The vm hits the hard connection limit and will never be healthy at that point. Autoscaling will stop scaling at that point.

2 Likes