Concurrency hard limit not respected?

Hi there, everyone! We’re experiencing some weird behaviour around load-balancing/concurrency that I’m hoping somebody has some insight about.

We were noticing that occasionally our Rails app was throttled due to going over CPU balance, causing long queue/response times. But we run several machines, and we noticed that when this happens, most traffic seems to all be hitting a single machine, and only that machine struggles under the weight.

To mitigate, we configured autoscaling via starting/stopping machines, with explicit soft and hard limits:

[http_service]
  processes = ["web"] # this service only applies to the web process
  http_checks = []
  internal_port = 8080
  protocol = "tcp"
  script_checks = []
  force_https = true
  auto_stop_machines = "stop"
  auto_start_machines = true
  min_machines_running = 5
  soft_limit = 10
  hard_limit = 15

Yet this hasn’t mitigated the issue at all - we’ll still often see a single machine experiencing high concurrency while other machines sit idle:

Furthermore, as can be seen above, it never seems to flag the machine as having hit the hard limit.

I’ve theorized for a while that these spikes may all be requests coming from a single client, as some load balancers will prioritize sending requests from a single client to a single machine. But I was under the impression that the hard limit should prevent even this case - that if concurrency hits the hard limit, the load balancer should prevent any more traffic from being routed to that machine.

It’s clear that I’m misunderstanding something. Can anybody explain why our hard limit doesn’t seem to be respected, and perhaps suggest how we might mitigate our problem?

Hi… This might just have been a copy-and-paste glitch in the above, but limits actually need to go under a special sub-section:

[http_service.concurrency]  # ←
  soft_limit = 10
  hard_limit = 15

(I wish flyctl itself would warn about these.)

Effectively, you had ∞ hard limit before…

Hope this helps!

:person_facepalming: Well, I do feel foolish! I was setting this up from the autostart/stop docs which I don’t think made it very obvious where these should go - I should have dug deeper into the specifics! I’ll update this and see if it solves the issue

I am still fairly curious about the load-balancing behaviour and how we end up in this situation in the first place, though. I find it interesting that we do still trigger the default soft limit (shown in the graph above) and yet so much traffic gets routed to the one machine. Is my suspicion about it all being traffic coming from a single client the likely culprit?

1 Like

Hm… That wouldn’t be my own guess; as far as I know, the Fly Proxy has no “stickiness” mechanism like that.

If you still see a big imbalance within a single region after changing fly.toml, try posting the multi-instance graphs (i.e., all Machines on the same chart) along with a detailed list of which regions you are using and what counts you have in each one—and things of that sort.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.