I’m seeing what appears to be incorrect routing behaviour in Fly Proxy. The proxy is consistently sending all traffic to one machine, letting it exceed soft_limit, and even starting additional machines — while a perfectly healthy, idle machine sits available and receives almost no traffic.
App setup:
8 machines in cdg, all healthy
type = "requests", soft_limit = 20, hard_limit = 35
Cloudflare in front (orange cloud), HTTP/2 to origin disabled
React Router v7 SSR app (Node/Express)
What the proxy should do
With two active machines, both healthy, and soft_limit = 20:
Route traffic to machine A until it reaches 20 concurrent requests
Once A is at or above soft_limit, prefer machine B which is near zero
Only start a third machine if both A and B are saturated
This is the documented behaviour. It is not what is happening.
What is actually happening
Machine 865134ae44d5e8 receives almost all traffic. Machine 865135be470278 sits near zero — even while d5e8 is above soft_limit = 20 and climbing toward hard_limit = 35.
When d5e8 briefly hits hard_limit, the proxy starts additional machines instead of routing to 470278, which is already running and idle. Those additional machines then receive little to no traffic.
This plays out in sustained windows of 10–20 minutes, not seconds.
Two Grafana screenshots of fly_app_concurrency by machine are attached — the pattern is consistent across the entire observation window, and is not correlated with any deployment or restart event.
What I’ve already ruled out
type = "requests" — Double checked that I’m not using “connections” here
Cloudflare Session Affinity — not subscribed to that product
fly-force-instance-id cookie — not set anywhere
fly-replay header — not used
HTTP/2 connection multiplexing from Cloudflare — disabled H2 to origin, no change in behaviour
soft_limit too high — even when d5e8 is visibly above 20 concurrent requests in the Grafana panel, 470278 is not receiving traffic
Machine health — both machines pass health checks, fly status shows all machines healthy
The core issue
The proxy is not routing to the idle machine when the active machine exceeds soft_limit. The documented behaviour says it should. It doesn’t.
Has anyone seen this? Is there a known condition under which Fly Proxy ignores soft_limit and keeps routing to an already-saturated machine instead of an idle healthy one?
Hi! I have been playing a little bit with hard limit and had it set to 30 for a while but no improvement.
Can you spare a couple of minutes to explain what’s in your mind? I’m not seeing how a decrease on hard_limit would help with this.
I mean, I understand that the proxy should start aggresively redirecting traffic to the idle machine as soon as the loaded one hits its soft_limit
I know what you mean. Fly Proxy is not very smart when it comes to handling load and distributing it across multiple machines. Sometimes in my apps, one machine is throttling while the other one is up and healthy, yet the traffic is NOT balanced between them and keeps hitting the one that is struggling.
They suggested setting a hard_limit, but if there is more than one machine running, “”“I don’t care”“” about hard limits — traffic should be balanced evenly anyway (in my opinion, we are paying for both machines, not just the one that is busy).
My workaround has been to run multiple processes to handle different workloads and place an Nginx reverse proxy in front to distribute the load, which adds complexity to the infrastructure.
FYI: I also reported this issue a few weeks back, but I don’t think anything has changed since my previous threads. Still, it’s “good” to know that someone else is experiencing the same problem.
I hope the Fly.io team adds this to their backlog.
In both of these cases (yours and OP), fly-proxy is, in fact, distributing traffic evenly, but machines are processing requests at different rates. This is usually fine unless the hard_limit is higher than what your machine can handle, which is why that is usually the suggestion here. In general, one should also avoid running CPU/IO-heavy tasks that may be wildly different between requests within the request processing path and instead use a dedicated task queue (e.g. Sidekiq for Ruby, Celery for Python).
Because in this case the hard_limit is higher than what a machine can actually handle. fly-proxy is always allowed to send traffic to machines up until hard_limit is hit or health checks start to fail. In fact, in this case we were sending less new requests to the instance that appeared to be more loaded. Proxy redirecting new requests elsewhere after soft_limit wouldn’t help if we were already sending less requests to the loaded machine.
I do agree that supporting some kind of metrics input sourced from inside machines for load-balancing would be useful here, but just looking at CPU usage isn’t a reliable strategy for load balancing on its own.
I’m afraid I’m not following…
Even if machines are serving requests at different rates, fly-proxy is supposed to route the requests based on concurrency.
With type = “requests” set on the fly.toml, the expected behaviour would be for it to balance requests based on request concurrency. It shouldn’t matter at what rate they are being processed. It should always send requests to the instance with less concurrent requests. At the very least, it should start sending request to the “idle” machine as soon as the soft_limit is reached on the “loaded” machine:
That being said, all the requests served by my app are approximately equal and resolve in a couple hundred of ms, and nevertheless one machine gets nearly all the traffic.
The only way request processing time would matter would be if the fly-proxy just round-robins the requests around healthy machines, regardless of their soft_limit or concurrency metrics…
That is correct. When balancing between nodes, proxy sorts all instances based on whether they are currently: below soft limit, above soft limit but below hard limit, or above hard limit. It always prefers instances sorted higher in the list (i.e. those below soft limits), and then round-robins between instances within each class except for those above hard limit.
There are some important nuances here though. Firstly, propagation of load information is not instant: when load spikes up and down the soft_limit very frequently, there is no guarantee that every proxy sees the latest information and stops sending requests to the machine above soft limit. This is why the limit is called soft, because we accept requests even if a machine appears to be above that limit; hard_limit is where we will definitely not route any more requests there, regardless of state propagation delays. This limit is checked on the local server running your machine, so the state is always consistent.
Second, below soft_limit, proxy does not sort instances based on the exact load number, and just round-robins between all instances below soft_limit. Checking new requests versus concurrent requests for your two main machines involved in your screenshot:
so: both started out receiving roughly equal amounts of new requests, but one of them seem to be processing it slower, causing the concurrency to be higher. Because the concurrency is extremely spiky, it’s likely that some other proxy nodes do not get a view of the instance being above soft_limit if their update happen to come in while the concurrency crosses below soft_limit. We did eventually start to send less requests to the more loaded instance, but that didn’t seem to resolve the higher concurrency.
So it comes back to the two questions / suggestions:
One of the machines seemed to be handling requests slower causing them to pile up; I am not exactly sure why, this could be an app-side or infra side issue, but neither of the machines were being CPU throttled at the time.
Setting a lower hard_limit would help if the one machine really cannot handle this load. This exact problem is why the distinction between soft and hard limits exist. It shouldn’t be possible to load a machine to a point where it is no longer able to process requests, all the way up until it hits hard_limit.
Metric-based balancing would help if crossing soft_limit caused the machines to actually saturate CPU/RAM/…, which is the case sometimes but maybe not for your app. But of course that does not really side step propagation delays.
It seems that fly-proxy is waking up ALL the available machines at moments where all active machines are below the soft_limit, and I’m again at a loss trying to understand why.
Please see attached grafana screenshot, where you can see that the max concurrency is 12 and the proxy decides to start all available machines. (soft_limit = 15, hard_limit = 25)
I think this is still due to the fact that the request concurrency is very spiky. The graph you’re showing above is probably not showing all the spikes as it seems to average over at least a few seconds due to how it is set up. If you change it to use a shorter interval, those spikes should show up.
When a request spike comes in, it is possible that one machine isn’t starting fast enough to handle all the request build-up, so the proxy ends up deciding to start more. There is a cooldown period before we do this to prevent thundering herd effects, but that duration isn’t super long because there are cases where we genuinely need to start up as many machines as possible to handle spikes. Using auto_stop_machines = "suspend" may help with this as it would accelerate the machine startup process.
I do think that in your case it might be worth working out why some requests / machines seem to process slower, and whether that apparent imbalance before you tuned down hard_limit was a real problem or not. If your app genuinely needs a lot of resources to process each request, it might also be worth identifying which steps are heavy and moving them to some kind of a job queue.
I had another performance alarm during this weekend, and I’m still seing machines over the hard_limit while a lot of other machines are idle. (please see attached screenshot)
One thing I’m still not able to reconcile: you mention that hard_limit is checked locally and is always consistent, meaning the proxy will definitively not route requests to a machine above hard_limit. But in the Grafana screenshot I shared, the saturated machine is sitting at 25 concurrent requests with hard_limit = 20.
If hard_limit is enforced locally and consistently, how is that machine receiving new requests past 20? Is fly_app_concurrency (the metric exported to Prometheus) measuring the same thing the proxy uses internally to enforce hard_limit? Or is there a discrepancy between the two counters that could explain this?