Fixing an Intermittent WebSocket Issue

Recently we started getting reports, particularly from users with Phoenix LiveView sites, that WebSockets would (sometimes) fail to establish.

Like all good problems, it seemed to only affect certain people, in certain regions, and disappeared when we tried to do anything to observe it. To great relief, we have finally tracked down the culprit. If you have encountered any strange WebSocket issues over the last week or so, this should be resolved for you now.

The Culprit

In our (http) topology, the Fly Proxy maintains pools of idle connections between hosts. There’s a pool of HTTP/2 connections to use, and a pool of HTTP/1.1 connections to use. WebSocket upgrades are HTTP/1.1, so they’ll use a connection from that pool at each hop.

There are two paths that a request can take through our network. If it arrives in the same region as one of your machines, it’ll travel Edge Proxy <> Backhaul Proxy <> VM. If it needs to travel to another region, it’ll travel Edge Proxy <> Multihop Proxy <> Backhaul Proxy (maybe) <> VM.

For http apps, each of the <> hops listed can reuse a connection from a pool, with different sharing semantics for each one. Backhaul <> VM is a pool of connections for your instance. Edge <> Backhaul is a pool of connections for your app. And Edge <> Multihop is a pool of connections for a bunch of apps.

An interesting case is when HTTP/1.0 requests and responses travel through these pools: They can use one of the HTTP/1.1 connections, as these pools are really just “not HTTP/2” pools. It turns out that, when a server responds with HTTP/1.0, hyper permanently marks that connection as HTTP/1.0. By default, HTTP/1.0 has Connection: close semantics, but if a Fly App returns an HTTP/1.0 response with an explicit Connection: keep-alive, hyper puts that (now HTTP/1.0-ified) connection back into the pool.

This means that later on, when a poor unsuspecting HTTP/1.1 request is given one of these connections to reuse, hyper does two problematic things:

  1. The connection is forcefully downgraded to HTTP/1.0
  2. The Connection header is unconditionally set to keep-alive

The former is intentional, the latter seems like a bug, and both can-or-will break WebSockets. WebSockets only work over HTTP/1.1, and strict backends enforce that the Connection header is correct. When this would happen, your backend might log something like, in the case of Phoenix:

** (WebSockAdapter.UpgradeError) 'connection' header must contain 'upgrade', got ["keep-alive"]

And the WebSocket upgrade would error out before it could establish.

There were two ways this could affect you. First, if your app ever responded with an HTTP/1.0 response containing Connection: keep-alive, then every hop from the user to your app would idle a downgraded connection. Most of these hops are pooled only for your app, so apps serving mixed HTTP/1.0 and HTTP/1.1 traffic could see this issue more frequently.

A bigger problem is the Edge <> Multihop connection. Since this pools connections across multiple apps, your requests could be affected by other apps returning this rather niche response and poisoning the connection pool.

Finally pinning this issue down lifted the fog on a lot of our confusion. It now makes sense why some people and places would see it more frequently, and why deploying the proxy would magically fix everything for a while. This problem would only noticeably occur when:

  1. Another app returns an HTTP/1.0 response (in 2026!) that has an explicit keep-alive header.
  2. You share a multihop connection pool with that app (these are sharded).
  3. That app returns their response along the same network path that one of your users takes to reach you.
  4. The connection stays idle in the pool, and by chance is the connection your request grabs out of the bag.
  5. Your request also, unluckily, is a WebSocket upgrade.
  6. And finally, your server strictly validates this. (Most do, some don’t!).

Thankfully while finding this was rather annoying, the quick fix is rather easy. If an app sends a response that is not HTTP/1.1 on a connection from the HTTP/1.1 pool, we burn that connection and don’t put it back in the pool :slight_smile:

This also highlighted a few other things we should improve in this request flow, so you’ll likely hear more on that topic in the future!

5 Likes