This is very neat, and could be a whole “how to build a global load balancer” article. The architectural problem we have here is that the farther an instance of our load balancer is from a VM, the harder it is to accurately know how loaded that VM is. Load fluctuates by the millisecond, and even in the best cases we have to wait a few hundred milliseconds to “see” any changes from some instances.
What ends up happening is VMs reach their hard concurrency limit (set in fly.toml) and we still send them traffic. Prior to this change, we’d queue new connections/requests locally and let the VM work through them as it could. This worked decently, but VMs still ended up with queues that wouldn’t quit.
The retry change prevents queueing in many cases. When a loaded up VM gets traffic it can’t handle now, it sends a message back to the load balancer saying “full, come back later”. The load balancer then reissues the request to another VM.
This is my favorite from last week. We abuse Consul in many ways. The most heinous thing we do is run it globally as if it’s on a single local network. That usually works fine for us, but sometimes it gets weird.
Consul includes an rtt feature that’s useful for finding the nearest node. When a request comes in for an app, we send it to the closest available VM. Consul rtt was very convenient for this … and also wrong in surprising ways.
We noticed that some requests coming in to Tokyo were being routed to Hong Kong VMs instead of Tokyo VMs. Hong Kong is 2900km from Tokyo (about 1800 miles), adding 50ms+ of dumb latency.
The culprit was Consul’s RTT metric. It was reporting that nodes in Toykyo were 100ms+ away from each other. Our routing logic naturally thought “50ms is better than 100ms so let’s go to Hong Kong”.
We ended up doing a ping tracking project we’ve been putting off for a while to improve this. Every node now pings every other node and we keep track of our own RTT, packet loss, etc. The net result is that requests end up at the lowest latency VMs they can be serviced from.
Incidentally, it’s odd that American companies tend to classify Tokyo/Hong Kong/Singapore/Sydney as “Asia Pacific”. Those cities are nowhere close to each other!
Is the 100ms handshake from an automated test? When we first issue certificates, each edge server has to retrieve them from cache. Which means if you’re running test on apps without activity, you need to run it multiple times to get things warmed up.
We track these pretty closely, here’s what we see from updown.io for most apps: