issue with flaky websocket connections in sshx.io

ekzhang · February 12, 2025, 5:56pm

Hi, I’ve been a big fan of fly.io for a while — I run sshx.io on the platform (since 2022). I use multiple regions and TLS ALPN, also with private networking for the machines to communicate with each other. sshx is a free, popular open-source project (6000 stars).

  [[services.ports]]
    port = 80
    handlers = ["http"]
    force_https = true

  [[services.ports]]
    port = 443
    handlers = ["tls"]
    [services.ports.tls_options]
      alpn = ["h2", "http/1.1"]

Since today, users have been reporting issues where WebSocket connections fail repeatedly until eventually succeeding. You run sshx, get a link to your remote terminal, and then it loads. But then the WebSocket connection that starts on the page takes up to 5-20 attempts before finally succeeding. A screenshot is below.

I can’t reproduce this on local hardware, even running the exact same software. I also went into the nodes and checked all their network connectivity looks good. Which makes me think this is an issue on the Fly Edge level in the ALPN / HTTP/1.1 / HTTP/2 handlers that I’m using.

Was there any change to these systems lately?

Sorry if this is a false report but I’m kind of stumped since this issue is only appearing on Fly.io, and I’ve exhausted all the other possible sources of issues. I also don’t see anything that could help in my metrics or logs.

Edit: Upon investigating further, this is a red herring. The logs I pasted below have always been occurring since before this report. However, upon deploying Version 0.3.1 of sshx-server, it seems to be working again, so the issue seems to originate from my change in Version 0.4.0. Still not sure why I can’t reproduce it locally though…

~~One potential breadcrumb is these lines in my logs:~~

2025-02-12T17:53:33.825 proxy[3287239ae94518] dfw [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=server->client, op=read, error=Connection reset by peer (os error 104))

2025-02-12T17:53:33.838 proxy[e28673eec17578] sjc [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=server->client, op=read, error=Connection reset by peer (os error 104))

2025-02-12T17:53:33.865 app[7811116f9e7248] fra [info] 2025-02-12T17:53:33.865220Z ERROR tower_http::trace::on_failure: response failed classification=Code: 5 latency=187 ms

2025-02-12T17:53:33.892 proxy[e28673eec17578] sjc [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=client->server, op=write, error=Broken pipe (os error 32))

2025-02-12T17:53:33.991 proxy[7811116f9e7248] fra [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=server->client, op=read, error=Connection reset by peer (os error 104))

2025-02-12T17:53:34.062 app[e28673eec17578] sjc [info] 2025-02-12T17:53:34.061825Z ERROR tower_http::trace::on_failure: response failed classification=Code: 5 latency=133 ms

The weird thing about these lines is that “Code: 5” is a Status.INTERNAL gRPC error. I’m running gRPC and HTTP/WebSocket on the same port 443, but I’m steering them to the correct service based on the Content-Type header value being application/grpc. So I don’t understand why I would be getting this Code: 5 anyway — would Fly.io have an issue maybe where the edge nodes are incorrectly caching HTTP headers when I establish a new connection with HTTP/1.1? Or did WebSocket handling change in any other way recently?

ekzhang · February 12, 2025, 6:02pm

Here is the relevant code that handles the steering between HTTP/gRPC: sshx/crates/sshx-server/src/listen.rs at 6b0e0aeee311844b9334afcd5e3d63b1c8c87e3b · ekzhang/sshx · GitHub

ekzhang · February 12, 2025, 8:38pm

Resolved, sorry it was an RFC 8441 issue. >:(

system · February 19, 2025, 8:38pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cannot connect to websocket server Questions / Help streams , proxy	8	3559	October 23, 2023
Unable to disable TLS in websocket connections server	3	429	April 11, 2023
Instance not responding over HTTPS Questions / Help	7	358	November 28, 2022
502 / bad gateway errors	16	3418	May 10, 2021
Fly Dashboard Constantly Refreshing Questions / Help	15	812	July 8, 2022

issue with flaky websocket connections in sshx.io

Related topics