understanding load balancing

Hey folks.

I have an app that handles uploads (think a constant flow of 100KB to 1MB images) to an object storage (tigris). The flow easily saturates the network bandwidth of a single machine (~100M from what I see) so I’m trying to use more machines and use load balancing. What I constantly see is that one machine is always chugging disproportionally more requests/connections (judging by the App Concurrency metrics), no matter what I do, unless I set hard_limit as well and only then the balancing works. The settings I use are:

[http_service.concurrency]
  soft_limit = 1
  hard_limit = 50
  type = 'connections'  # tried requests too

Am I doing something wrong here?

Thanks

The load balancing strategy Fly.io uses is such that if a machine has a lower RTT or closer to the source of request, it could receive more requests even if it’s at its soft limit.

soft_limit = 1 is too low. Increasing this limit would affectively balance your requests to other machines.
Also, do you have any machines that are failing their health checks?

2 Likes

To add what Ashitag said, 1 is too low can will cause thrashing in the proxying. In this case, you’d want to use type = 'requests.

Also, I would recommend using presigned urls so the bandwidth doesn’t go through your servers.

1 Like

Very good point. It’s a great use case for presigned URLs, allowing uploads to go directly to cloud storage, which scales much better than your app. If you need to process images after upload, you can pull them one by one from the storage.

I’m talking about machines in the same iad region. I had 10 machines and 1 was chugging like 90%+ of the traffic. All machines are healthy and passing health checks.

I used 1 in an attempt to make proxy send requests to other machines, tried 2 and 5/etc before.

Yes, presigned urls are a good idea, and we did try using them but were getting some errors on them so decided to ditch them and handle that ourselves so we have full control. Tigris folks are aware, but we were together not able to get to the bottom of it right away and then we switched to this current way which has proven to be very reliable so far.

Here is a screenshot of my experiments with the same load-simulating script and different concurrency settings in fly config. The last one is

[http_service.concurrency]
soft_limit = 1
hard_limit = 20
type = ‘requests’

which seems to working the best.

The pic is a bit confusing and has some bits of different experiments, but as you can guess those unwieldy single-color ones are from not setting a hard_limit at all, where one machine was chugging most of the requests for some reason.

Perhaps there’s a bug w/ the load balancing not accounting for this scenario? What do you mean a constant flow of uploads? Do you have a script that dumps a bunch of files to your server?}

I’ll be working on implementing uploads soon, so I’ll keep an eye out on this behavior.

The app concurrency metric isn’t the best for seeing how requests are distributed across machines.

The metric only shows how many requests are currently being processed at the time the metric is collected. That means if the machine happens to be processing no requests during the split second that the metric is being collected, it will report 0 even if it was processing 1000 requests a split second ago.

I would suggest using something like the following query with a stacked graph to get an idea of the distribution of requests:

sum(increase(fly_app_http_response_time_seconds_bucket{app="APP_NAME"}[15s])) by (instance)

Tigris uses R2, so I don’t think there would any issues using presigned URLS. What errors were you getting? If you check your network tab, there most likely was a CORS error in the preflight OPTIONS request. If so, it’s just a matter of editing your bucket’s CORS policy.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.