Finding good [services.concurrency] settings without bringing down prod ;-)

Hi,

first off, I really really appreciate you guys answering questions in the forum diligently!

In fly.toml, is my understanding of the [services.concurrency] section correct?

  • A concurrent request means 1 ongoing TCP request, which usually takes a couple seconds max.
    25 concurrent requests, for example, mean 25 requests being handled at the same time (and probably sent roughly at the same time)
    • How is this handled with WebSockets (and Long-Polling fallbacks)?

And in terms of concurrent requests:

  • When autoscaling is enabled, a new VM instance is brought up when the hard_limit is reached (and very likely to when the soft_limit is reached)

  • When autoscaling is disabled, requests are being queued when the hard_limit is reached and served when below. The soft_limit is unused.

All that said, how do I determine good limits without brining down the app in production – or having users camp out overnight, in line for their request to be served? :camping:

If my understanding is correct, a…

  • hard_limit too high could risk crashing our application due to large amounts of requests (going OOM), especially with sudden bursts of traffic

  • hard_limit too low could either spin up way too many VMs or stall requests for a long time

Thanks in advance!

1 Like

Hi @merlin,

  • Your understanding of services.concurrency is mostly correct. Worth mentioning the type option (which it looks like you already discovered), which defaults to connections, but can be set to requests for the http handler. The app’s concurrency (reported by the fly_app_concurrency metric and the ‘VM Service Concurrency’ graph on the Metrics tab of the Dashboard) will be based on either connections or requests based on this setting.
    • WebSockets connections (including long-lived, idle connections) are all included in the concurrency calculation (as are long-polling requests). This does make it less convenient to use autoscaling with applications using tons of long-lived, idle connections that consume few resources. (We’ve considered making the query used by autoscaling configurable in the future- let us know if this would be helpful for your use-case.)
  • Beyond not routing requests to instances at the hard_limit, the load-balancer also prefers instances under the soft_limit if any are available. So even when autoscaling is disabled, the soft_limit still acts as a ‘hint’ to help load-balance traffic more evenly.
  • As a general recommendation for tuning concurrency, I would start by setting a conservatively-high hard_limit mostly as a failsafe to prevent OOM from spikes in traffic, and then focus on tuning the soft_limit as a tighter bound for optimal scaling and load-balancing decisions. Then you can focus on tweaking the soft_limit over time to best fit your workload without worrying too much about bringing down the app from a too-low hard_limit blocking requests.
1 Like

Thank you, this is really helpful!!! Much appreciated