lingering connections and ghost vms.

WillRaben · September 14, 2022, 3:14pm

We’re having an issue with a production app since yesterday, suddenly the max concurrency limit was reached (2022-09-13T12:16:00.711 proxy[7fa6e5c2] iad [warn] Instance reached connections hard limit of 25) and since it doesn’t have scaling set up I’m guessing the proxy started bouncing/queuing all the connections, effectively making the app unreachable (phoenix liveview app).

Because the load is very light, I upped the connection hard limit and requested a re-deploy but immediately noticed a steady increase in the amount of connections even when the app wasn’t being used, to the point where I had to restart the app because it had reached the new limit after some hours. Additionally, ghost vms were showing in the metrics tab(with maxed out connections).

The puzzle is that we haven’t really pushed any new code and the behavior started suddenly yesterday, I increased the hardlimit yet again and requested a new deploy but the ever increasing connections keep growing at a steady pace, even during off time for the app.

Currently there’s two vm showing in the metrics.

zombie vm: 6a99f75f - iad

reachable vm: 9ca7bf8a - iad

This application had no problem with the default hardlimit of 25. I can’t rule out an error in our configuration but it’s still very puzzling to have this happen with no recent deploys.

Any help would be much appreciated because we’re currently at a point were I’m having to manually restart the app when it’s reaching the set hardlimit.

ignoramous · September 14, 2022, 10:38pm

Where are these requests / connections coming from? Are they not legitimate clients? If not, you’d need wire-logs to see what’s going on?

Looks like Fly is removing these on a case-by-case basis. The word at the time was that Fly apps using release_command were the ones turning into zombies: Interrupt app startup - #10 by eli

WillRaben · September 14, 2022, 11:30pm

Thanks, I’m still investigating and will soon re-deploy with more detailed logging to get to the bottom of what’s going on. Seems to be a case of already opened tabs reconnecting to a new socket but somehow the previous one doesn’t get dropped? client side only shows one active connection.

We do use release_command for running migrations and when I get the status of the zombie vm I do get some info:

but trying to stop the vm through the ctl returns this:
Error failed to stop allocation: You hit a Fly API error with request ID: 01GCZ3RGC8WK9B1DADQKBBRM7S-scl

Do I need to post on that thread so someone @ fly kills it manually?

ignoramous · September 15, 2022, 12:40am

I’ll take the liberty of at’ing @eli (sorry Eli) on your behalf. It could be (?) that the zombie vm issue is unrelated to the connection pile up you’re seeing.

Sockets? As in websockets? So, something keeps ping/pong’ing when it previously didn’t, keeping the socket alive? Interesting that without any code change and change in client-count, one’d encounter such an increase. Fly employs some LiveView / Elixir experts, so I’d defer this question to them.

eli · September 15, 2022, 1:19am

Hi! So taking a look at the logs and the fly status output for 6a99f75f, it really doesn’t seem like that VM actually exists. As you’ve probably noted, its status is complete according to nomad. What’s more, that request ID for the API also tells us that the VM can’t be found, returning a 404.

So I don’t think that the zombie vm in the metrics and the number of app connections are related, primarily because I can’t find any evidence that this app instance exists.

I’m fairly certain there’s enough missing metadata for this app instance that its connections-- even in the unlikely case that there were a route to it-- wouldn’t count towards your app’s connections. I can understand why you’d want to make sure it wasn’t a contributing factor, though

On our end, we’ll take care of 6a99f75f as soon as we’re able, and give the load-balancing components a closer look while we’re at it. We’ll let you know if we find anything interesting!

Please feel free to update this thread with anything you’re curious about while you’re investigating. We’re also always happy to look over any information you’ve gathered while checking your app out

WillRaben · September 16, 2022, 6:35pm

Hi! So taking a look at the logs and the fly status output for 6a99f75f, it really doesn’t seem like that VM actually exists. As you’ve probably noted, its status is complete according to nomad. What’s more, that request ID for the API also tells us that the VM can’t be found, returning a 404.

So I don’t think that the zombie vm in the metrics and the number of app connections are related, primarily because I can’t find any evidence that this app instance exists.

I am sorry for the confusion, I didn’t mean to imply that the ghost vm’s connection count was being added towards the active vm’s count. I mentioned the two issues because they happened to correlate in the timeline; right after the first re-deploy I did to up the hard limit to solve the outage, I started seeing the ghost vm in the metrics.

While conducting the research, I found that most of the traffic was coming from our users in Toronto and Montreal but our app was running in Fly’s iad zone(both the app and postgres).

So I moved the app to a multi-instance (non-clustered) approach using only yul and yyz. This move mitigated the issue by A LOT

Blue, yellow and green are the ghost vms that either reached or were very close to reaching the concurrency hard limit. The connections in the new zones were still trending up (not related to usage or new number of users) but it was drastically slower.

This leads me to believe this is related to connection stability, because, like I mentioned, we hadn’t pushed any new code.

However, the plot thickens. Although we hadn’t pushed any code, I remembered we have had some issues with LiveView and a weird, random behavior that our app experiences because of it. This was never cause for lingering connections though. Or for the default hard limit of 25 to be reached. But I can’t rule out this issue being related, even though, as annoying as it is, it’s very random and not at a rate that could cause the lingering connections problem.

The symptom is that randomly, an active page will refresh on its own. Or sometimes when the tab that has the application goes off focus, clicking it back into focus would trigger a refresh. This isn’t related to fly.io as it happens in our tests with gigalixir too. But it seems to happen more often(but not exclusively) when connecting from Rogers ISP, in Canada. We’ve tried varying timeout values to no avail. I would really appreciate any insight on why this might be happening. We’re on Phoenix LiveView 0.15.7(I know, it was a risk, but LiveView was just amazing to wait, we’ll definitely update).

To take things to the twilight zone(at least for me), I’ve kept monitoring the metrics to scale up and then kill the vm that would be close to reaching the hardlimit, and now I see this:

Maybe something changed in fly’s proxy? I did not trigger a restart

Where could I find specifics on how are those metrics sourced?

Topic		Replies	Views
problem increasing app's connections `hard_limit`	4	1296	July 9, 2023
What does "dfw [warn] Instance reached connections hard limit of 25" mean?	5	713	May 26, 2023
Sudden increase in connections causes hard_limit to be exhausted (even with minimal test-case app doing no work) Questions / Help proxy	4	77	February 25, 2025
Maximum amount of connections to a single VM Questions / Help metrics , streams , autoscaling , proxy	4	114	February 2, 2025
Could not proxy HTTP request. Retrying in 1000 ms	16	998	March 7, 2023

lingering connections and ghost vms.

Related topics