issue with flaky websocket connections in sshx.io

Hi, I’ve been a big fan of fly.io for a while — I run sshx.io on the platform (since 2022). I use multiple regions and TLS ALPN, also with private networking for the machines to communicate with each other. sshx is a free, popular open-source project (6000 stars).

  [[services.ports]]
    port = 80
    handlers = ["http"]
    force_https = true

  [[services.ports]]
    port = 443
    handlers = ["tls"]
    [services.ports.tls_options]
      alpn = ["h2", "http/1.1"]

Since today, users have been reporting issues where WebSocket connections fail repeatedly until eventually succeeding. You run sshx, get a link to your remote terminal, and then it loads. But then the WebSocket connection that starts on the page takes up to 5-20 attempts before finally succeeding. A screenshot is below.

I can’t reproduce this on local hardware, even running the exact same software. I also went into the nodes and checked all their network connectivity looks good. Which makes me think this is an issue on the Fly Edge level in the ALPN / HTTP/1.1 / HTTP/2 handlers that I’m using.

Was there any change to these systems lately?

Sorry if this is a false report but I’m kind of stumped since this issue is only appearing on Fly.io, and I’ve exhausted all the other possible sources of issues. I also don’t see anything that could help in my metrics or logs.

Edit: Upon investigating further, this is a red herring. The logs I pasted below have always been occurring since before this report. However, upon deploying Version 0.3.1 of sshx-server, it seems to be working again, so the issue seems to originate from my change in Version 0.4.0. Still not sure why I can’t reproduce it locally though…

One potential breadcrumb is these lines in my logs:

2025-02-12T17:53:33.825 proxy[3287239ae94518] dfw [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=server->client, op=read, error=Connection reset by peer (os error 104))

2025-02-12T17:53:33.838 proxy[e28673eec17578] sjc [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=server->client, op=read, error=Connection reset by peer (os error 104))

2025-02-12T17:53:33.865 app[7811116f9e7248] fra [info] 2025-02-12T17:53:33.865220Z ERROR tower_http::trace::on_failure: response failed classification=Code: 5 latency=187 ms

2025-02-12T17:53:33.892 proxy[e28673eec17578] sjc [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=client->server, op=write, error=Broken pipe (os error 32))

2025-02-12T17:53:33.991 proxy[7811116f9e7248] fra [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=server->client, op=read, error=Connection reset by peer (os error 104))

2025-02-12T17:53:34.062 app[e28673eec17578] sjc [info] 2025-02-12T17:53:34.061825Z ERROR tower_http::trace::on_failure: response failed classification=Code: 5 latency=133 ms

The weird thing about these lines is that “Code: 5” is a Status.INTERNAL gRPC error. I’m running gRPC and HTTP/WebSocket on the same port 443, but I’m steering them to the correct service based on the Content-Type header value being application/grpc. So I don’t understand why I would be getting this Code: 5 anyway — would Fly.io have an issue maybe where the edge nodes are incorrectly caching HTTP headers when I establish a new connection with HTTP/1.1? Or did WebSocket handling change in any other way recently?

1 Like

Here is the relevant code that handles the steering between HTTP/gRPC: sshx/crates/sshx-server/src/listen.rs at 6b0e0aeee311844b9334afcd5e3d63b1c8c87e3b · ekzhang/sshx · GitHub

Resolved, sorry it was an RFC 8441 issue. >:(

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.