It looks like we may have an old instance that was never shut down, which is still processing jobs from our job queue.
Background:
We have a Phoenix app that uses the Oban job queue. We recently noticed that, in the oban_jobs table in Postgres, the “attempted_at” timestamp is sometimes almost 2 minutes earlier than the “scheduled_at” timestamp, which indicated clock drift. We logged into the 2 instances currently in our cluster + the database server and the system times appear to be in sync. Then we checked the “attempted_by” column and found that the the rows with “attempted_by” timestamps in the past always have the same instance ID: production@fdaa:0:309a:[…]:55b5:2. These started on Oct. 28 and the drift has kept increasing.
So, is it possible to check to see if there’s some instance out there that wasn’t shut down properly?
Thanks @Mark actually this turned out to be an issue on your end: a “zombie” instance that was left running after your Oct. 27 outage. It did not show up in our dashboard. I contacted support via email and Kurt eventually diagnosed the problem and shut down the instance. However, we still have a leftover issue: our logs are being flooded with messages like this:
[libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"
…which is again the IP of the “zombie” instance. Again I don’t believe this is anything we can resolve on our end. I’ve gone back and forth with your support and hopefully they’re looking into it now.
I’m glad Kurt was able to help you resolve the zombie instance issue!
As for this:
This is normal and should stop after a couple minutes. This is libcluster trying to re-establish a connection from a running node to the lost/down node. After a period of time it stops trying.
You will see messages like this when you deploy a new release of your app because old nodes (and addresses) go away from the cluster.
Hopefully that helps you feel more comfortable with these non-error messages.
@Mark sorry if I wasn’t clear, but these messages have been occurring for months, with the same IPv6 address - specifically, the IPv6 address of the specific “zombie” instance that Kurt killed over a month ago. Here’s the context:
Thanks for the additional background! That is unusual. I haven’t seen that before.
I see from your log display that libcluster is trying to connect to the node that was removed. Because of the fly6pn topology, I’m assuming you are using the Cluster.Strategy.DNSPoll strategy. If that’s true, then I doubt the project has any nodes explicitly configured into it like the docs show is possible. That just wouldn’t make sense.
Are all the nodes identical Elixir apps? Or are there multiple apps in the same Fly organization?
Wondering if this issue is relevant to your situation.
@Mark the nodes are identical Elixir apps. We have other apps running in the same organization, but these have different names. The “zombie” instance was part of the cluster for our production application when it was alive. Even when it was no longer visible In the console, it was processing Oban jobs so it had the correct DB credentials, etc.
I also assume that the app has been deployed at least one since the zombie instance was shutdown? Have you tried doing a fly apps restart <APP_NAME>. The idea being to prod them to forget the removed node.
I’m not aware of anything from the Fly-side that could or would keep around Elixir’s awareness of a removed VM. I suspect it must be something from within the app. Yes, there was a problem where a rogue VM was left around, but with that gone, I can’t think of anything that would make an Elixir node still try to connect to that instance. Can you?
If you iex into a running instance, and run Node.list(), it lists all the other nodes known to the one you are connecting to. If the zombie node is still listed then it must somehow be running and connected… which doesn’t make sense. I’m stumped.
@Mark the zombie instance ID from the log messages doesn’t show when I do Node.list(). We have two VMs running, and when I do Node.list() on one, it only shows the other one.
$ fly ssh console --app production
Connecting to fdaa:0:309a:a7b:92:cf73:54ce:2... complete
/ # app/bin/enaia remote
Erlang/OTP 25 [erts-13.0.4] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [jit]
Interactive Elixir (1.13.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(production@fdaa:0:309a:a7b:92:cf73:54ce:2)1> Node.list()
[:"production@fdaa:0:309a:a7b:9d36:e805:87c3:2"]
$ fly status --app production
[...]
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
cf7354ce app 1114 iad run running 1 total, 1 passing 0 54m14s ago
e80587c3 app 1114 iad run running 1 total, 1 passing 0 54m53s ago
The only other idea for a source that I have is possibly around the Oban jobs? I don’t have any production experience with Oban, but, is it possible that jobs logged by the old node, or worked on but errored, that the application code might use the source node name in some way? Like it’s trying to reach out to the node based on that information?
That behavior might be specific to the application.
Hope you had a good New Years! And may your logs be happy and clean this year.
Thanks Mark, I’m not sure that Oban has anything to do with this. I think the issue is that libcluster continues to believe that the now deleted VM is part of the cluster and keeps trying to connect to it.
Just to recap: we originally determined we had an issue when we noticed that Oban jobs were being processed by a node with a clock that was off by several minutes. At that point we asked for help, and Kurt eventually determined that there was a “zombie” instance resulting from a Fly outage, which he then cleaned up. That node seems to have been shut down - at least, it’s no longer processing Oban jobs.
I don’t know the details of how libcluster tracks nodes, but when that VM was shut down, libcluster didn’t recognize it and continues to try to connect to it. This continues to happen despite many restarts. The old node ID must be stored persistently somewhere?
I don’t think there’s anything unusual in our code - I posted everything in our application related to clustering and it seems completely standard. We don’t have the time to dig into the internals of libcluster to figure out how this could have occurred. So, we’re stumped and unless anyone has other ideas, I think the best we can do is to schedule some downtime and move to a new application.
A brief downtime would be to stop all instances and then bring them back up. If anything is somehow caching it in memory, then that should flush it out.
I’m not aware of libcluster doing anything to hold-on to that information though. Very weird. The only other idea is to fly ssh console into a node and do some non-Elixir networking checks. But I wouldn’t expect to find anything anyway.
@Mark we did a complete shutdown by scaling to 0 instances, confirmed that no VMs were running, then scaled back up. The logs still shows ongoing failed connections to production@fdaa:0:309a:a7b:4b:e84f:55b5:2. If that instance ID only exists in memory, I’m not sure how it would still be present after a complete shutdown. Let me know if you see anything wrong here.
$ fly scale count 0 --app production
Count changed to 0
$ fly status --app production
App
Name = production
Owner = enaia
Version = 1139
Status = dead
Hostname = production.fly.dev
Platform = nomad
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
$ fly scale count 2 --app production
Count changed to 2
$ flyctl logs -a production
[...]
2023-01-07T08:00:10Z app[abb388b3] iad [info]08:00:10.775 [warning] [libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"
2023-01-07T08:00:18Z app[66585a8e] iad [info]08:00:18.096 [warning] [libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"
2023-01-07T08:00:22Z app[abb388b3] iad [info]08:00:22.780 [warning] [libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"
2023-01-07T08:00:30Z app[66585a8e] iad [info]08:00:30.100 [warning] [libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"
2023-01-07T08:00:34Z app[abb388b3] iad [info]08:00:34.784 [warning] [libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"
2023-01-07T08:00:42Z app[66585a8e] iad [info]08:00:42.104 [warning] [libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"
$ fly ips private -a production
ID REGION IP
19a979b3 iad fdaa:0:309a:a7b:9d36:19a9:79b3:2
8fa07628 iad fdaa:0:309a:a7b:92:8fa0:7628:2
$ fly dig fdaa:0:309a:a7b:9d36:19a9:79b3:2 -a production # GOOD IP
;; opcode: QUERY, status: NXDOMAIN, id: 65392
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; QUESTION SECTION:
;fdaa:0:309a:a7b:9d36:19a9:79b3:2. IN AAAA
;; AUTHORITY SECTION:
. 1 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2023010901 1800 900 604800 86400
;; ADDITIONAL SECTION:
;; OPT PSEUDOSECTION:
; EDNS: version 0; flags: ; udp: 4096
$ fly ping fdaa:0:309a:a7b:9d36:19a9:79b3:2 -a production # GOOD IP
35 bytes from fdaa:0:309a:a7b:9d36:19a9:79b3:2 (fdaa:0:309a:a7b:9d36:19a9:79b3:2), seq=0 time=74.6ms
35 bytes from fdaa:0:309a:a7b:9d36:19a9:79b3:2 (fdaa:0:309a:a7b:9d36:19a9:79b3:2), seq=1 time=76.7ms
^C
$ fly dig fdaa:0:309a:a7b:4b:e84f:55b5:2 -a production # BAD IP
;; opcode: QUERY, status: NXDOMAIN, id: 34893
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; QUESTION SECTION:
;fdaa:0:309a:a7b:4b:e84f:55b5:2. IN AAAA
;; AUTHORITY SECTION:
. 1 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2023010901 1800 900 604800 86400
;; ADDITIONAL SECTION:
;; OPT PSEUDOSECTION:
; EDNS: version 0; flags: ; udp: 4096
$ fly ping fdaa:0:309a:a7b:4b:e84f:55b5:2 -a production # BAD IP