understanding load balancing

ruslan1 · September 15, 2024, 3:42pm

Hey folks.

I have an app that handles uploads (think a constant flow of 100KB to 1MB images) to an object storage (tigris). The flow easily saturates the network bandwidth of a single machine (~100M from what I see) so I’m trying to use more machines and use load balancing. What I constantly see is that one machine is always chugging disproportionally more requests/connections (judging by the App Concurrency metrics), no matter what I do, unless I set hard_limit as well and only then the balancing works. The settings I use are:

[http_service.concurrency]
  soft_limit = 1
  hard_limit = 50
  type = 'connections'  # tried requests too

Am I doing something wrong here?

Thanks

ashitag · September 15, 2024, 5:28pm

The load balancing strategy Fly.io uses is such that if a machine has a lower RTT or closer to the source of request, it could receive more requests even if it’s at its soft limit.

soft_limit = 1 is too low. Increasing this limit would affectively balance your requests to other machines.
Also, do you have any machines that are failing their health checks?

khuezy · September 15, 2024, 5:56pm

To add what Ashitag said, 1 is too low can will cause thrashing in the proxying. In this case, you’d want to use type = 'requests.

Also, I would recommend using presigned urls so the bandwidth doesn’t go through your servers.

Elder · September 15, 2024, 6:26pm

Very good point. It’s a great use case for presigned URLs, allowing uploads to go directly to cloud storage, which scales much better than your app. If you need to process images after upload, you can pull them one by one from the storage.

ruslan1 · September 15, 2024, 7:46pm

I’m talking about machines in the same iad region. I had 10 machines and 1 was chugging like 90%+ of the traffic. All machines are healthy and passing health checks.

ruslan1 · September 15, 2024, 7:47pm

I used 1 in an attempt to make proxy send requests to other machines, tried 2 and 5/etc before.

ruslan1 · September 15, 2024, 7:49pm

Yes, presigned urls are a good idea, and we did try using them but were getting some errors on them so decided to ditch them and handle that ourselves so we have full control. Tigris folks are aware, but we were together not able to get to the bottom of it right away and then we switched to this current way which has proven to be very reliable so far.

ruslan1 · September 15, 2024, 7:53pm

Here is a screenshot of my experiments with the same load-simulating script and different concurrency settings in fly config. The last one is

[http_service.concurrency]
soft_limit = 1
hard_limit = 20
type = ‘requests’

which seems to working the best.

The pic is a bit confusing and has some bits of different experiments, but as you can guess those unwieldy single-color ones are from not setting a hard_limit at all, where one machine was chugging most of the requests for some reason.

khuezy · September 15, 2024, 7:54pm

Perhaps there’s a bug w/ the load balancing not accounting for this scenario? What do you mean a constant flow of uploads? Do you have a script that dumps a bunch of files to your server?}

I’ll be working on implementing uploads soon, so I’ll keep an eye out on this behavior.

charsleysa · September 15, 2024, 11:33pm

The app concurrency metric isn’t the best for seeing how requests are distributed across machines.

The metric only shows how many requests are currently being processed at the time the metric is collected. That means if the machine happens to be processing no requests during the split second that the metric is being collected, it will report 0 even if it was processing 1000 requests a split second ago.

I would suggest using something like the following query with a stacked graph to get an idea of the distribution of requests:

sum(increase(fly_app_http_response_time_seconds_bucket{app="APP_NAME"}[15s])) by (instance)

khuezy · September 16, 2024, 10:13pm

Tigris uses R2, so I don’t think there would any issues using presigned URLS. What errors were you getting? If you check your network tab, there most likely was a CORS error in the preflight OPTIONS request. If so, it’s just a matter of editing your bucket’s CORS policy.

system · September 23, 2024, 10:14pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Issue with Autoscaling Based on Request Count in Fly.io autoscaling , proxy	5	72	October 27, 2024
Advanced concurrency, scaling & load balancing Questions / Help machines	8	98	January 17, 2025
Rolling my own autoscaling for Fly Machines	11	2327	August 24, 2023
Using fly.io as an alternative to AWS Lambda Questions / Help	6	2058	June 21, 2023
Autoscaling auto_stop_machines not working appsv2	10	850	October 16, 2023

understanding load balancing

Related topics