I see there was an outage reported related to UDP around 15 hours ago, but I see no mention of TCP, so figured I’d create another thread. It went down around the same time. Basically the client gets connected to the app’s external IP seemingly just fine, but that’s it. The app does not receive any connection request, the logs only show the period health checks on the TCP port. This leaves the clients in an awkward state where they successfully connect to the TCP port and then just wait forever for the hello packet from the server… Is Fly aware this is happening and has any plans for a fix? This would be quite annoying if we were running production services here. Worth noting that last few days we also experienced a few random connection resets on those TCP connections as well, sometimes logged as ECONNRESET in the app logs and sometimes not.
Hi! Are you still having problems with this, or are you just talking about last night? If so, could you share what kind of client application are you using to connect to your fly app, and what the results are when you run it?
Even if you’re just hanging when you try to connect, this information will help us understand what (and where) you’re running into problems.
Hey, I was still having the issue 100% of the time when creating this post, but I see that right now it started to work fine like before again.
The application is a native video game application that connects over TCP to the masterserver and awaits the hello message from it, after which it would attempt to authenticate and then expect to receive some other data.
Looking at Wireshark capture, it did seem that while issue was occurring (since last night until somewhere around the post creation date) that a connection to your edge would be created, but this connection would not successfully be proxied to the app on Fly (its logs showed no client connection):
As such the game client would hang forever, since it is expecting the hello message immediately after TCP connection is estabilished. The connection timeout would never kick in, since it expects a disconnection if something is broken. Besides figuring out what went wrong, I think Fly should just close the connection if it cannot ever reach the app instead of keeping it open forever. Otherwise there is no way it can even try to recover until the client process is manually killed by the user.
And I’m pretty sure it was Fly’s routing issue, since the app was logging accesses on the health check connections, which is the same TCP port, so the app itself was working fine during that time. And even now it seems it started receiving connections again without any restart.
Btw, on an unrelated note, overall the raw TCP connection feature on Fly is nice, but I feel it needs a bit more documentation. The PROXY protocol and the ability to switch it to v2 is barely documented (v2 support is only mentioned on some forum post), and it’s pretty important since without it the app does not see the real source IP, and this fact is also not mentioned anywhere. Just my 3 cents here
It also now occurred to me that Fly is supposed to have a timeout of 60 seconds after which a normal connection is dropped with no activity (also I think not documented btw xd), and that clearly wasn’t happening here during the outage, indicating something went wrong as well
I do not believe this was an issue with our proxy. The outage yesterday only affected UDP Anycast and DNS lookups to external DNS servers.
Our internal proxy logs show timeouts trying to send data from the app around 23:00UTC yesterday. The VM was replace a couple of hours later and I don’t see any after that.