Hi, I’ve been a big fan of fly.io for a while — I run sshx.io on the platform (since 2022). I use multiple regions and TLS ALPN, also with private networking for the machines to communicate with each other. sshx is a free, popular open-source project (6000 stars).
[[services.ports]]
port = 80
handlers = ["http"]
force_https = true
[[services.ports]]
port = 443
handlers = ["tls"]
[services.ports.tls_options]
alpn = ["h2", "http/1.1"]
Since today, users have been reporting issues where WebSocket connections fail repeatedly until eventually succeeding. You run sshx
, get a link to your remote terminal, and then it loads. But then the WebSocket connection that starts on the page takes up to 5-20 attempts before finally succeeding. A screenshot is below.
I can’t reproduce this on local hardware, even running the exact same software. I also went into the nodes and checked all their network connectivity looks good. Which makes me think this is an issue on the Fly Edge level in the ALPN / HTTP/1.1 / HTTP/2 handlers that I’m using.
Was there any change to these systems lately?
Sorry if this is a false report but I’m kind of stumped since this issue is only appearing on Fly.io, and I’ve exhausted all the other possible sources of issues. I also don’t see anything that could help in my metrics or logs.
Edit: Upon investigating further, this is a red herring. The logs I pasted below have always been occurring since before this report. However, upon deploying Version 0.3.1 of sshx-server, it seems to be working again, so the issue seems to originate from my change in Version 0.4.0. Still not sure why I can’t reproduce it locally though…
One potential breadcrumb is these lines in my logs:
2025-02-12T17:53:33.825 proxy[3287239ae94518] dfw [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=server->client, op=read, error=Connection reset by peer (os error 104))
2025-02-12T17:53:33.838 proxy[e28673eec17578] sjc [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=server->client, op=read, error=Connection reset by peer (os error 104))
2025-02-12T17:53:33.865 app[7811116f9e7248] fra [info] 2025-02-12T17:53:33.865220Z ERROR tower_http::trace::on_failure: response failed classification=Code: 5 latency=187 ms
2025-02-12T17:53:33.892 proxy[e28673eec17578] sjc [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=client->server, op=write, error=Broken pipe (os error 32))
2025-02-12T17:53:33.991 proxy[7811116f9e7248] fra [error] [PP02] could not proxy TCP data to/from instance: failed to copy (direction=server->client, op=read, error=Connection reset by peer (os error 104))
2025-02-12T17:53:34.062 app[e28673eec17578] sjc [info] 2025-02-12T17:53:34.061825Z ERROR tower_http::trace::on_failure: response failed classification=Code: 5 latency=133 ms
The weird thing about these lines is that “Code: 5” is a Status.INTERNAL gRPC error. I’m running gRPC and HTTP/WebSocket on the same port 443, but I’m steering them to the correct service based on the Content-Type
header value being application/grpc
. So I don’t understand why I would be getting this Code: 5 anyway — would Fly.io have an issue maybe where the edge nodes are incorrectly caching HTTP headers when I establish a new connection with HTTP/1.1? Or did WebSocket handling change in any other way recently?