At Fly.io, we run a global network of Anycast “edge” servers that automatically balances requests and/or connections to Machines closest to your users. The core of this load-balancing is a piece of software called fly-proxy (which is different from the fly proxy command available as part of flyctl!). It terminates all connections targetting Anycast IP addresses, runs load-balancing and forwards the bytes to the appropriate machines.
Welcome to Tokio
fly-proxy is written in Rust and makes use of the Tokio async runtime. A quick refresher on Rust async basics: the building block of asynchronous code in Rust is Futures. A Future can be polled, which returns either the result or Poll::Pending, asking the runtime to poll at a later time.
Every async function in Rust returns a Future, and every time .await is used on them, they are composed into a bigger Future. The innermost layers are types like TcpStream which represent basic I/O operations. After layers of composition, usually a top-level Future ends up being spawned onto the async runtime (in our case, Tokio) as an individual “task”. The runtime is responsible for driving each task forward.
As you may be able to guess, a typical pattern of serving TCP-based traffic in this model is to have one task accepting connections in a loop, and then spawning tasks on the runtime for each connection to serve them, something like
async fn accept_connections(listener: TcpListener) {
while Ok((conn, _)) = listener.accept().await {
... // Make sure we actually want to accept this, etc.
// Note that serve_connection(conn) returns immediately with a Future --
// _that_ is what gets spawned as a task here.
tokio::spawn(serve_connection(conn));
}
}
async fn serve_connection(conn: TcpStream) {
while Ok(...) = conn.read(...).await {
...
}
}
Though much more complicated, at its very heart, fly-proxy works under the same model – every connection gets turned into a Tokio task serving that connection. The Tokio runtime implements a work-stealing scheduler that tries to run all tasks spawned onto it fairly.
It turns out that whether a scheduler is “fair” depends a lot on the context, and in a multi-tenant scenario like with fly-proxy, sometimes a fair scheduler can be quite unfair.
Scheduling Fairness in Tokio
At a very high level, Tokio implements fair scheduling in 2 aspects. First is how it picks tasks to run: every OS thread has a ready queue that stores all Futures that can be run, and each time it needs something to run, the head of that queue is popped. Each new ready task is pushed to the back of the queue (with some caveats to optimize message passing). This means that we can be sure that every task is eventually going to be run.
This alone doesn’t prevent a task from holding on to the CPU for a long time. Consider the accept_connections function in the previous example: it is ready every time the underlying TcpListener has a connection to be accepted. That Future representing the function will only yield back control to the scheduler when it becomes not ready. But what if it never does? What if the rate of new connections is so high that the .accept() future is always ready? In that case, the task never yields and Tokio’s scheduler gets no chance to schedule anything else on the same thread. The same applies to the serve_connection function.
In OS threads, this is solved by forcing user-level threads to yield every so often using something like a processor time interrupt. Go does something using signals. In Tokio, the cooperative scheduling system makes each task yield every so often. This is done by inserting yield points in most I/O types provided by Tokio, and assigning a “polling budget” to each top-level future (task). When a task can be said to “make progress” (such as accepting a connection, reading some bytes, etc.), a unit of budget is consumed. Each task has a cap of 128 units of polling budget every time it is run, and when that becomes 0, the Future is forced to yield by returning Poll::Pending regardless of whether it can make more progress immediately.
How Much Budget Would You Like? Yes!
In a program like fly-proxy that serves multiple tenants, this fairness guarantee may not, in fact, be desirable, because it is based on fairness between tasks, which do not map nicely to tenants. Consider a Fly App that regularly serves connections 1000 times more than a typical smaller app, and each connection sends a considerable amount of traffic, enough to make a large portion of these connections ready to make progress most of the time. Because each connection is a Tokio task, we get the following consequences:
- The run queue of Tokio is swamped with tasks from this single fly app. Even if there are tasks for other apps in there, it is more likely at any given moment that Tokio will pick a task from this app instead of others;
- Each task gets the default 128 units of polling budget. Subsequent runs of the Tokio scheduler has no idea that it was running a task from the same app (tenant) in a previous iteration, and gives each iteration the full 128 units of budget. Effectively, this means that the one app which dominates the number of connections gains more polling budget, and can starve apps with less connections of a chance to be polled.
To be clear, from the scheduler’s perspective, this is perfectly fair: we’re still serving each connection fairly. But from our perspective, this is unfair: why should an app get more chances of polling just because it created more connections? Why should another app not get a fair share of resources, just because it is not trying to create tens of thousands of concurrent connections?
Part of the solution here is to clamp down on bandwidth when one single app starts to dominate an edge server; @bglw has been working on a solution on that front. But: even without high bandwidth, a sufficiently high number of concurrent connections that send frequent but small transfers can and has caused similar problems. We’ve even seen cases where the proxy’s keepalive loop with systemd failed to be scheduled in time, causing systemd to kill the process.
Constraining Single-Tenant Polling Budget
The obvious solution here is to somehow “group” tasks belonging to each tenant, such that
- Polling budget should be shared across all tasks belonging to a tenant; and
- We should prevent the runtime from picking tasks belonging to the same tenant all the time, even if that tenant dominates in the number of connections.
We considered implementing our own cooperative scheduling and polling budget by adapting Tokio’s implementation, but that necessitates wrapping every single I/O Future that can be considered to “make progress”. If only there’s something that can “package” a bunch of Futures as one single task, and run them in a “sub-queue” like some sort of… “sub-runtime”?
…oh wait, that’s just the well-known FuturesUnordered, from the futures crate, which used to be where the Future type lived, but now provides a bunch of utilities and extensions for Future-related types. This is usually intended for cases where we’d like to run a few actions in parallel, but don’t want to explicitly spawn a bunch of new Futures onto the runtime. Incidentally, it is also almost exactly what we’re looking for here: it has an internal, independent run queue, and it itself can be polled as a Stream (like a Future but can be polled for multiple values, hence Stream). All we need is a loop like
let (tx, rx) = mpsc::channel();
let mut fut = FuturesUnordered::new();
tokio::spawn(async move {
loop {
tokio::select! {
_ = fut.next() {}, // oh, something completed? that's good, but we don't really care the result here
future_to_spawn = rx.recv() {
match future_to_spawn {
Some(future_to_spawn) => fut.push(future_to_spawn);
None => break
}
}
}
}
});
and simply send Futures we would have originally tokio::spawn’d into that tx channel. Note that the code above has some obvious pitfalls – that’s because this is just for illustrative purposes and I don’t want to bore everyone with the grueling details of dealing with all the edge cases ![]()
This solution ensures that each tenant, instead of each connection of every single tenant, is treated as one “unit” in the scheduler’s eyes. A tenant only gets 128 polling budget as a whole, and if it has consumed all of it during one iteration, it will yield to a different tenant instead of another connection of the same tenant. Note that fairness between connections within each tenant is still preserved here: if one task of a tenant (in one FuturesUnordered group) consumed all of the budget, the task yields first back to FuturesUnordered, which will also respect the yield and place the task to the back of its own run queue so that next time the runtime polls this FuturesUnordered, it’s likely that a different task will be polled first.
One more slight problem here is that since each FuturesUnordered ends up as one Tokio task, each group like this can only consume up to one CPU core at once. To make use of multiple cores, we need to spawn multiple FuturesUnordered loops per tenant, and spawn tasks onto them on a round-robin basis.
The Numbers
This change has been deployed to production fly-proxy, but unfortunately we don’t yet have a lot of data in production on how well this handles connection spikes we see from time to time. We might report back later! But in the mean time, I have run some local benchmarks simulating 2 apps: app 1 is serving a lot of connections with tons of traffic, basically saturating lo, and using oha, I am running a simple HTTP latency benchmark against app 2 simultaneously.
Here’s the output without this change:
Summary:
Success rate: 100.00%
Total: 180.0027 sec
Slowest: 1.8259 sec
Fastest: 0.0078 sec
Average: 0.2710 sec
Requests/sec: 184.6694
Total data: 48.37 MiB
Size/request: 1.49 KiB
Size/sec: 275.15 KiB
Response time histogram:
0.008 sec [1] |
0.190 sec [15036] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.371 sec [10449] |■■■■■■■■■■■■■■■■■■■■■■
0.553 sec [4426] |■■■■■■■■■
0.735 sec [1894] |■■■■
0.917 sec [704] |■
1.099 sec [401] |
1.281 sec [178] |
1.462 sec [47] |
1.644 sec [43] |
1.826 sec [12] |
And here’s the output with this change:
Summary:
Success rate: 100.00%
Total: 180.0058 sec
Slowest: 1.1801 sec
Fastest: 0.0073 sec
Average: 0.1650 sec
Requests/sec: 303.0014
Total data: 79.41 MiB
Size/request: 1.49 KiB
Size/sec: 451.72 KiB
Response time histogram:
0.007 sec [1] |
0.125 sec [25335] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.242 sec [18747] |■■■■■■■■■■■■■■■■■■■■■■■
0.359 sec [6644] |■■■■■■■■
0.476 sec [2352] |■■
0.594 sec [862] |■
0.711 sec [336] |
0.828 sec [158] |
0.946 sec [41] |
1.063 sec [12] |
1.180 sec [4] |
This is in no way the best benchmark ever – running all of this on a single machine definitely messes with the latency numbers – but the average is cut almost in half. Even though we still got a somewhat high tail latency at 1.18 seconds, the whole latency distribution is still much much better than before, and the total req/s is also almost double for app 2 with this change. App 1 is still mostly able to saturate bandwidth when it does not affect app 2.
Part of the still-high tail latency may boil down the fact that we’re running this on a single machine, and that in order to be fair to the non-FuturesUnordered case, we allowed each “group” (tenant) here to use all of the CPU cores with num_cpu number of FuturesUnordered instances. If we restricted that to only a subset of cores (this is the configuration we’re running in production), this of course looks even better:
Summary:
Success rate: 100.00%
Total: 180004.3510 ms
Slowest: 681.4741 ms
Fastest: 2.8153 ms
Average: 66.7988 ms
Requests/sec: 748.3708
Total data: 196.23 MiB
Size/request: 1.49 KiB
Size/sec: 1.09 MiB
Response time histogram:
2.815 ms [1] |
70.681 ms [88492] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
138.547 ms [38794] |■■■■■■■■■■■■■■
206.413 ms [5479] |■
274.279 ms [1209] |
342.145 ms [360] |
410.011 ms [167] |
477.876 ms [85] |
545.742 ms [44] |
613.608 ms [23] |
681.474 ms [7] |