The fly proxy waits for machines to become available even though there's more machines

For a small example I launched 5 exactly the same worker machines with the following concurrency config:

[http_service.concurrency]
type = “connections”
soft_limit = 1
hard_limit = 1

All the machines are suspended.

Now when I run a test script that connects to the service 5 times, usually 2 (sometimes 3) machines get resumed, serving 2 requests. The remaining requests are put in queue to wait for the machines to become available.

My expectation was that all 5 machines would be resumed to serve the requests as quickly as possible. Did I misunderstand? This is the rest of my config for reference:

app = ‘app-name’
primary_region = ‘ams’

[build]
dockerfile = ‘Dockerfile’

[http_service]
internal_port = 3002
force_https = true
auto_stop_machines = ‘suspend’
auto_start_machines = true
min_machines_running = 0
processes = [‘app’]

[http_service.concurrency]
type = “connections”
soft_limit = 1
hard_limit = 1

[[vm]]
memory = ‘4gb’
cpu_kind = ‘performance’
cpus = 2

Hey @Probert

Does this happen with requestsconcurrency as well?

Hmm, I’ll have to try that out with a different setup. These machines run websocket servers, so currently it doesn’t make sense to limit by requests. (Right?)

Oh, right, with websockets it won’t make any difference. I have a theory why that’s happening - with requests concurrency we grab a permit to make a request very early on, so if an edge proxy sent a request to a machine that’s already has an active request/connection (even if that connection is waiting for the machine to start/resume) and is at hard_limit, we will reject the request and it will be retried on a different machine.

With connections concurrency (and for websockets) such permit is grabbed right before actually establishing a connection, which happens after the proxy has already waited for the machine to start/resume.

Let me see if I can fix this so we grab the permit before issuing start/resume request for connections concurrency as well.

@Probert

Just to verify that that’s indeed what’s happening. Could you make those 5 requests with flyio-debug:doit header set and share fly-request-id response header for each one of them?

Done!

These are the response headers:
{ name: ‘fly-request-id’, value: ‘01K9ABHX0MGX07K5947P4TSNR8-cdg’ }
{ name: ‘fly-request-id’, value: ‘01K9ABHX0SVEBDCFYYS3K8RYQF-cdg’ }
{ name: ‘fly-request-id’, value: ‘01K9ABHX0RFCV2QF24ZKVXFSYG-cdg’ }
{ name: ‘fly-request-id’, value: ‘01K9ABHX0WTMYZ0DSF6ZBB1JTB-cdg’ }
{ name: ‘fly-request-id’, value: ‘01K9ABHX0P1E1MJ835V98KB5CG-cdg’ }

Edit: Funnily enough it did resume all 5 machines this time, although the requests still seemed to wait for the previous request

Thanks!

Does your app stop listening as soon as it accepts one request and upgrades it to a wesocket connection? It seems like some of the requests attempt to connect to the same machine (because of that permit problem I described) and only one of them sees the machine as started/resumed (the procedure here is to start the machine and repeatedly attempt to connect to it to check when it’s started):

First request:

2025-11-05 15:53:18.647 gonna do the machine dance
2025-11-05 15:53:19.162 connected to local service/machine
2025-11-05 15:53:19.162 Machine became reachable

Another request trying to start/connect to the same machine:

2025-11-05 15:53:18.647 gonna do the machine dance
2025-11-05 15:53:19.156 Machine hasn't started yet...
2025-11-05 15:53:19.164 Machine hasn't started yet...
...
2025-11-05 15:53:26.481 Machine hasn't started yet...

This probably amplifies the problem, as the proxy doesn’t even reach that piece of code where it tries to acquire the permit. If it was able to connect, it would’ve returned a retryable error much faster and let the edge proxy retry the request on a different machine.

Anyway, I think trying to acquire the permit early on before attempting to start a machine like with requests concurrency should help here. I’ll see what I can do.

Thanks for the super fast help!

For context: in my test the machines simply run a Playwright BrowserServer. It should technically accept many connections apart from the configured concurrency limit.

Hey @Probert

Thanks again for the traces. Those were really helpful at finding the root cause.

I’ve just deployed a potential fix to the servers where your machines are running. Could you please check again to see if it makes things better for you?

Hi @pavel

Works like a charm! All machines are resumed pretty much instantly and all requests are handled simultaneously. For completeness sake I also tested with 10 requests, of which 5 nicely waited for an available machine.

Will this fix be permanent?

1 Like

Yes, absolutely. I’ll roll it out globally a bit later.

Lovely. Thanks a lot, that was quick!