Proxy Fairness, Take 2 (Or: Why Did I Have Trouble Connecting to Fly in Europe)

A while back we posted about how we were improving scheduling fairness in fly-proxy, our load balancer / anycast proxy that sits in front of all Fly apps. Turns out, when the solution looks stupidly easy, there’s always a take 2 lying around somewhere!

This all started a couple weeks back on a Saturday when we received flappy alerts related to fly-proxy in some European regions like cdg and fra. Because they were flappy, we initially didn’t think much of it (it could just be a fluke), until we realized that real connections have been failing or timing out in those regions. As we usually do, we started an internal incident, tried to determine what’s going on, but it stopped happening as we started investigating. Best kind of incidents! Because it was a Saturday, we all went back to continue our own (important or not) weekend business :trade_mark: thinking this can be a thing we’ll look into on the following Monday.

Except, the same problem came back, to a lesser extent, on Sunday, and on Monday too. fly-proxy wasn’t having as much trouble, but its latency still shot up by a considerable amount. Not enough to start failing a lot of connections, but still significant. We were quickly able to identify which app was the source of the load: it was genuinely a popular app that happened to be receiving a lot of connections at the time, and was not a case of abuse or attacks. However, based on the amount of bytes they were pushing through us, it really did not look like it should have caused so much trouble for fly-proxy. We regularly push a couple times more traffic out of a single edge server than that!

Another clue came from their use of WebSockets. There’s nothing specifically wrong with WebSocket, and this is not related to the other WebSockets issue we fixed recently. But: we do know that WebSocket is often used to transfer small-but-frequent message streams – that’s pretty much the use case for them! We were able to confirm that indeed, there was a spike of small packets while this happened accompanied by the target app almost at its full concurrency limit. This also means that, when we’re handling a lot of such WebSocket connections for a single app, many of them may get woken up by the OS’s TCP socket at roughly the same intervals. Recall: that means a Future needs to be polled in async Rust. Our runtime, tokio, would then suddenly have a flood of tasks in its queue. When it finishes polling all of them, tens or even hundreds of milliseconds might have passed, and that is forever when we’re talking about a task scheduler for a network application. Everybody else is blocked in the mean time, causing proxy to appear to be slow or fail in some cases.

This scenario is exactly what we attempted to address last time. It definitely did help, but not enough to suppress one app from causing huge latency spikes. And by huge, I mean huge: we were observing completely unrelated tasks being delayed by hundreds of milliseconds. That, after ruling out issues with other parts of our code base, can only mean the Tokio runtime itself was delayed by that much. In other wise, it’s the good ol’ fairness issue back to haunt us again.

It turns out that there are two things we overlooked from our last attempt at this:

  1. By default, Tokio’s cooperative scheduling module imposes a budget of 128 for each top-level task. That translates to 128 * number of threads per organization after our last attempt at ensuring fairness. However, this budget is only consumed by Tokio’s I/O objects such as TcpStream when they’re polled, and not really by anything else. With all of our wrappers on Tokio’s types and our extra logic on top, it turned out to be enough to still cause latency spikes even with a cap on polling budget.
  2. In our last attempt, we used FuturesUnordered as a “sub-executor” to group each organization’s tasks together in order to limit their polling budgets, so that you can’t gain more by simply creating more connections. It turns out that FuturesUnordered doesn’t really interact with Tokio’s cooperative scheduling well: even though FuturesUnordered detects when a task yields intentionally and propagates the yield to stop unnecessary polls, the way Tokio’s tasks yield when they exhaust the budget is different. They pretty much bypass FuturesUnordered’s sub-executor, making it oblivious of the fact that it has exhausted the polling budget. So, it’ll happily call poll every single remaining task, just for them to return Pending again, wasting cycles that could have been used to process more urgent tasks.

The fix is, unsurprisingly, 2-part:

  1. We added more tokio::task::coop::poll_proceed and consume_budget calls to our custom Future / Stream wrappers, so that we consume more budget as our wrappers do more processing. This forces tasks to yield more often than before.
  2. We forked FuturesUnordered to add a check for Tokio’s polling budget when it receives a Pending from sub-tasks. Fortunately, this type is pretty self-contained, so that it could be extracted into our codebase with the one-line change applied.

We are fairly confident that this should reduce the effect of this kind of low-traffic but high-frequency connection spikes going forward. Of course, we’re also improving how we’re doing monitoring for this class of issues, so they don’t get written off as “probably just network flappiness” in the first place. We’ll also be closely monitoring traffic spikes in the near future to ensure that our fix did work as expected.

4 Likes

I’m glad to read you probably got to figure out the technical aspect of this repeating major issue.

But this part is what concerns me the most to be honest:

If I understand correctly, here is the sequence of events that occurred:

  1. An alarm did trigger internally.
  2. The monitoring team decided to ignore it because it was the weekend — doesn’t this team have people dedicated to handling weekend incidents? I would assume weekends are the most risky time of the week, so a fully managed hosting service would have a team dedicated to that timeframe.
  3. The engineering team decided not to look into in any further coming back to the office on Monday.
  4. The support team decided to ignore support topic explicitly titled Two major outages in two days, and yet no status updates?

That’s a lot of teams deciding to ignore various sources of reporting, even though the internal alarm system did actually trigger.

I have been relying on Fly.io for years and things have been going great most of the time, but this makes me wonder if it was mostly luck. It took 3 major outages in just 8 days for someone to actually start look into it.

I understand that the geo aspect of the issue made it harder to detect and investigate, but this is precisely part of Fly.io’s offering, so I would have thought you had great tooling and experience internally to address this kind of issue.

1 Like

Seems like we’re back at it today and it’s again not visible on your status page?

2 Likes

Yep, same situation again

1 Like

It is up now:

https://status.flyio.net/incidents/jz5txftk16q3

(Marked 20:15 UTC.)

Possibly this hits some people more than others, at first?