Incorrect app concurrency metrics affecting Fly Proxy scaling

Hi. I noticed something odd in the Grafana dashboard for my Fly app: the app concurrency metrics are all wrong. The service I’m hosting primarily handles long-running connections, so I expect the concurrency to be fairly constant. Instead, the metric jumps between 1, 2, and 18 connections (all of which are wrong):

My app publishes its own metrics regarding number of active connections and is holding at a pretty steady 46 connections for the same period (taking into account the occasional reconnection from a client):

This also doesn’t seem to be a fluke of only the metric; Fly Proxy’s auto-scaling is affected as well. The http_service.concurrency.soft_limit for my app is 35, but Fly Proxy scaled down my app even though there were 47 active connections at the time (the app concurrency metric reported 35 connections across two machines at the time, which was incorrect).

Here is the fly_app_concurrency compared to my app’s self-reported concurrency metric over the past two days (Pacific Time):

It starts to drift at around 2025-08-06 23:00:15 PT. I suspect redeploying the app may temporarily mitigate the issue, but I’m afraid it will just occur again in due time.

Any insight into this would be appreciated.

Hm… The left part of that last graph in particular looks like the old “zeroing” behavior, :dragon::

That was changed subsequently, but maybe a few proxy nodes have reverted to the old ways?


I’d also suggest saying a bit more, if you could, about what the clients are doing overall. E.g., are these WebSockets that are mostly idle, or very active RTMP, …? And it would probably be prudent to post the full fly.toml, since there were some nuances mentioned in the past with multiple services, etc.

(Feel free to * out any names that you consider sensitive, but show the full structure.)

Hope this helps a little!

They are WebSocket connections with the server sending a message to the client at least every 30 seconds.

Here it is: transit-tracker-api/fly.toml at main · tjhorner/transit-tracker-api · GitHub

This made me curious so I checked as far back as Fly metrics go (seems to be ~2 weeks):

The first instance of “drift” I see is at 2025-07-29 17:00:00 Pacific Time. Hopefully that’s helpful in some way.

1 Like
[http_service.concurrency]
  type = "requests"
  soft_limit = 35

Thanks for the details… You want type = "connections" for WebSockets, actually. (And also fly_app_concurrency was showing request count earlier, as I understand it.)

"requests" usually is the best choice, but…

I can try that, but it’s kind of strange that type = "requests" worked for these WebSocket connections in the first place, no? I’ve been using that option for at least five months and haven’t noticed this issue occurring until now.

Could you clarify what you mean by earlier?

In the graphs that you posted at the top of the thread.

It is strange, and I don’t know the details of what that would really be counting…

1 Like

When I originally switched from connections to requests I actually recall wondering if it would affect fly_app_concurrency (and consequently auto-scaling), so I tested to make sure it worked even with those long-running WebSocket connections. I guess something recently changed with how Fly Proxy calculates it :person_shrugging:

I’ve switched back to connections and will monitor for a bit to see if the issue still occurs. My app didn’t really benefit from the connection pooling/reuse with requests anyway, so I’m not too torn if I need to keep it this way. Thanks for the pointer.

1 Like

Checked back in after a few days and it seems that the fly_app_concurrency metric is still drifting pretty significantly from the actual number of concurrent connections, even with http_service.concurrency.type = "connections".

Zooming on a small period of time (the below example is “last 5 minutes”) shows the strange pattern where it swaps between two numbers (in this case, 9 and 33):

Restarting the machine temporarily fixes the issue:

Hi @tjhorner — thanks for the detail and the timestamps!

We tracked down a culprit and should have a fix out, let me know if things are looking better now. (You’ll possibly need a restart to get the count back into a good state)

1 Like

Thanks for the update - looking good since I restarted the misbehaving machine yesterday. I’ll keep monitoring and do another restart if it happens again.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.