Hello everyone, hope you’re all doing well.
I have a somewhat specific situation and I’d really appreciate some guidance from the community.
First, some context:
-
One of my applications doesn’t have that many end users. Most of the traffic in this particular application comes from integrations with partners who send us batch submissions.
A “batch” is basically everything they produced during the day, which they send to us through an endpoint in our API.
-
Each of our clients sends their batches at specific times. Right now, since we don’t have many clients, it’s easy to distribute them across different time windows. However, we’re planning to grow, and this is starting to become a concern.
-
We also intentionally adopted this time-based distribution strategy because we’re using Fly.io’s managed Postgres on the shared tier, and we want to avoid overloading the database (and avoid scaling up prematurely while we still have a relatively small client base).
Now, the actual problem:
We recently onboarded a new client who sends us a batch with around 80,000 records, and depending on the case, those 80,000 records may trigger additional requests if there are specific metadata fields to send. In practice, this client ends up sending us roughly 120,000 requests.
- However, each request they send is synchronous, meaning they wait for our API to confirm the response before sending the next one (or retrying if needed).
Because of that, the number of simultaneous “requests" or “connections" is not very high, which causes the fly-proxy to route most (or all) of the traffic to a single machine. Since these requests are CPU-bound, we end up getting throttled even when another machine is running and should be able to help handle the load.
An important detail:
- We have a machine running Nginx as a reverse proxy, which is currently our single point of failure. We use Nginx because we need to apply rate limits on certain endpoints to protect against brute-force attacks and similar scenarios.
One thing that did help was applying throttling to this specific endpoint. By adding a rate limit in Nginx, we trigger our partner’s retry mechanism, which ends up smoothing the load on our side.
PS: Our Nginx is routing traffic via Flycast.
map $host $fqdn_xpto {
hostnames;
default 127.0.0.1;
xpto.com prod-xpto.flycast;
xpto-homolog.com homolog-xpto.flycast;
}
I’ve been considering replacing Nginx with OpenResty to implement something closer to a round-robin strategy with the fly-replay. I also tried setting the soft limit to 1, but the issue still persists.
Does anyone have suggestions on what I could do to better handle this scenario and balance the traffic between machines with existing fly.io’s tech.
I would really prefer to avoid going down the OpenResty route if possible.
The rate limiting strategy on this endpoint did help to stabilize things, but it significantly slows down the batch delivery process.
Thanks in advance!
.