Balancing improvements for HTTP apps and retry storm protection

Balancing improvements

One of the main tasks of fly-proxy is to balance incoming HTTP requests to the instances of an app. The proxy has to consider various factors when choosing an instance that’s gonna handle a request: RTT towards the server, health status of both the server and the instance (based on the health check you defined and the behavior that proxy observed), whether or not an instance is allowed to be started if it’s not currently running, and, finally, current instance load (number of requests or connections that it’s currently handling).

Due to the global nature of the platform we used to broadcast very coarse information about instance load - for every remote instance the proxy knew whether the instance is below soft_limit, above soft_limit or at hard_limit. This was enough in a lot of situations but fell short if requests weren’t equal (e.g. some took more time to get handled) or an app had a spike in traffic.

As a consequence of regionalization work we now don’t have to maintain information about all the instances in each proxy and can broadcast more detailed load information. The proxies now broadcast actual precise load to their peers within the same region and also piggy-back load information in HTTP response headers so it can propagate even faster.

With this information available, fly-proxy now balances HTTP requests slightly differently - instead of picking a random instance from an otherwise identical group of instances, the proxy now picks two and chooses the one with lower actual load (the power of two choices). This helps the proxy pick optimal instances in more cases, improves request distribution and lowers response latency.

Retry storm protection

In some cases fly-proxy is allowed to retry requests sent to your app, for example, when it couldn’t connect to the server that’s running a machine or to the machine itself, or when it couldn’t find a single healthy instance available to handle a request. Such retries improve overall availability as small problems like network hiccups or flappy health checks do not cause the requests to fail. But they could also be a problem if not done carefully.

If your app suddenly became popular and got slashdotted, excessive retries not only prevent the app from recovering, they also put additional load on the proxy itself. fly-proxy is inherently a shared resource used by all the apps on the platform and their clients and has to carefully balance the time it spends processing requests to a particular app . An app that’s not behaving correctly or simply receiving way too many requests for the amount of provisioned resources should not affect other apps running on the platform.

We will now apply backpressure to HTTP requests if an app is not keeping up with the load to protect both the app and the proxy itself. Specifically, if the proxy is already retrying enough requests to an app, it will delay processing of new incoming requests to that app until the existing ones are handled. The requests won’t be dropped, they will simply wait in a queue until existing requests in retry loop either succeed or fail due to retry limit. If your app experience few (or none) HTTP retries, the backpressure mechanism will not trigger and requests will be processed as usual.

6 Likes