My app just went down from the EU / AMS regions and I am seeing the following logs:
2023-08-16T16:02:19.844 proxy[e2860eec94d538] ams [error] timed out while connecting to your instance. this indicates a problem with your app (hint: look at your logs and metrics)
And on my personalized status page I see this:
We are performing maintenance on a host some of your apps instances are running on. Apps may be unavailable until the maintenance is completed.
This makes sense since the machine in ams was stopped, I’m assuming by the “maintenance” but it isn’t actually removed from the proxy meaning my application is down even though there are 3 more perfectly fine machines to handle the traffic. This isn’t what I expected and is very disconcerning that “maintenance” happens all of a sudden without any notification and that traffic is still routes to the machine being maintained… what is happening and can machines being maintained please be removed from the proxy? From my point of view there is nothing I can do since I can’t destroy or remove the machine myself from what I can tell?
Edit: I did manage to destroy the machine in the AMS region, but the logs keep pooring in that “indicates a problem with your app” even though the other regions operate just fine. Is there a way to signal the proxy to sync up with the state of my app?
Edit 2: Everything seems to be functioning again, but I would love a reply since this is not why I have multiple machines running ofcourse to have a unscheduled maintenance kick my app offline for a single machine to be unavailable, thought that was the strength of Fly one region down another seemlesly takes over!
I noticed that your app is using TCP services (protocol = "tcp"), so our load balancer is routing TCP connections to your app instances without any knowledge of what those TCP connections are used for (maybe HTTP requests, maybe something else). Once a TCP connection is opened on an instance it gets locked to that instance, and packets will just get lost if the underlying instance becomes unavailable. In other words, the load balancer can’t migrate established TCP connections like it can route individual requests for HTTP services (protocol = "http"), because it doesn’t know what your application is doing with its opaque TCP stream.
If you want a TCP app to be resilient to single-instance unavailability, you’ll either need to set up standard retry logic in your clients (to establish new TCP connections when they are closed or timeout), or use HTTP load balancing if possible (protocol = "http"), which will dynamically route individual HTTP requests (not long-lived TCP connections) to healthy instances.
When talking about long lived TCP connections I can understand why what happened happened.
However my application is using TCP connections but I serve HTTPS connections only, so connections are pretty short lived. Unfortunately my use case requires me to serve my own certificates so the Fly proxy is not one I can rely on and protocol = "http" is therefore out of the question.
So, when these parameters are known (I am serving short lived TCP connections to serve HTTPS traffic) it makes no sense to me that if an instance is taken offline by Fly for whatever reason it also isn’t removed from rotation from the Fly proxy so that new connections will be properly routed to healthy instances instead and that I manually needed to destroy the machine for that to happen, or am I totally missing the point here?
(oops, I meant handlers = ["http"] on the service port, not protocol = "http" on the service, but it’s not relevant anyway)
As far as I can tell, the load balancer behaved as you expected (routing connections to other instances) during the ~7 minutes the host became unavailable during maintenance (from 2023-08-16T15:46:00Z→2023-08-16T15:53:00Z).
I’m taking a closer look into what appears to be a bug that caused the machine instance to remain stopped, and prevented it from starting back up after the host maintenance was completed. This seems to have caused the timeout errors and unexpected state you saw, and we’ll try to put together a fix for this issue soon so it doesn’t happen again.
Okay, I might have misread the situation since I still had the status page entry up I assumed maintenance was still ongoing, which is my bad. But happy to hear the maintenance worked as I would have expected, and it surfaced a bug in the process.
Regardless that the machine should have started again it should never route traffic when in the stopped state, no matter what caused it to end up in that state, maintenance or user stopping the machine should’ve had the same result?