Load balancing within a region

How does the load balancing work with multiple app instances within a region? It appears that one instance is serving most of the requests within a single region. Why would this be?

I have 3x app instances in the DFW region.

$ fly status | ag dfw                                                                                                                                                                  
633ef9dc	app    	243    	dfw   	run    	running	1 total, 1 passing	0       	15m10s ago
edc5f96e	app    	243    	dfw   	run    	running	1 total, 1 passing	0       	15m10s ago
12b0e9f4	app    	243    	dfw   	run    	running	1 total, 1 passing	0       	15m10s ago

When I tail the logs and count how many lines are for each region, approximating the number of requests being served by each app instance, there is a massive difference between them (looking at the log lines this seems reasonable), where one instance is serving most of the requests.

After running “flyctl logs > fly.logs” for a while, I then run…

$ cat fly.logs | ag 633ef9dc | wc -l                                                                                                                                                   
$ cat fly.logs | ag 12b0e9f4 | wc -l                                                                                                                                                   
$ cat fly.logs | ag edc5f96e | wc -l                                                                                                                                                   

Why would one app instance serve ~15x the requests than the other two?
I think most of the requests are likely bing/google but I wouldn’t expect that to matter.

Our load balancing strategy boils down to: send a request to the least loaded, closest, instance. If many instances have the same load and closeness values, then it randomly picks one in the set.

Load is determined by the concurrency limits and how many connections an instance is currently serving. If your concurrency limits are set to the default (soft: 20, hard: 25), then this happens based on how many connections are established:

  • 0 - 20 - “under soft limit” - we can send a request there
  • 20-25 - “over soft limit” - we’ll only send a request there if no other instance is under its soft limit
  • 25 - “reached hard limit” - we’ll never send a request there

If your app is not too busy, it’s likely all your instances fall in the “under soft limit” bracket and they’re all good candidates for a request.

Closeness is determined by RTT (round-trip time) between our edge node and the worker node where your instance runs. Even within the same region, we use different datacenters with different RTTs. These RTTs are measured constantly between all servers.

Looking at your specific app, I see it spawns across multiple datacenters in DFW. One of them has a slightly better ping than others from the vantage point of the edge I’m testing from. This means it will be chosen every time unless your instance reaches a higher load bracket.

This behaviour might sound odd, but it’s the fastest way to get to your app. Even if it saves only 2ms, if your app is under its soft limit then that means it’s as responsive as any other instance, but it’s closer so we might as well send a request there.

The way to affect the behaviour of our load balancing here would be to change your concurrency limits. If you create a bigger distance between the soft and hard limit and if you lower your soft limit, then your instances are more likely to fall in the “over soft limit” bracket. If this happens, instances which have not yet reached their soft limit will be prioritized and requests will be balanced between close-by servers.