Instance never shut down?

It looks like we may have an old instance that was never shut down, which is still processing jobs from our job queue.

Background:

We have a Phoenix app that uses the Oban job queue. We recently noticed that, in the oban_jobs table in Postgres, the “attempted_at” timestamp is sometimes almost 2 minutes earlier than the “scheduled_at” timestamp, which indicated clock drift. We logged into the 2 instances currently in our cluster + the database server and the system times appear to be in sync. Then we checked the “attempted_by” column and found that the the rows with “attempted_by” timestamps in the past always have the same instance ID: production@fdaa:0:309a:[…]:55b5:2. These started on Oct. 28 and the drift has kept increasing.

So, is it possible to check to see if there’s some instance out there that wasn’t shut down properly?

Hi @enaia,

You should be able to see all of your instances in your fly dashboard. You can browse to change the organization as appropriate.

You can check under fly status for your “production” app and see if you have multiple instances.

You can also match up the IP you see in your logs using: fly ips list

Under your application > Monitoring, you can see the multiple instances of your application as well.

Thanks @Mark actually this turned out to be an issue on your end: a “zombie” instance that was left running after your Oct. 27 outage. It did not show up in our dashboard. I contacted support via email and Kurt eventually diagnosed the problem and shut down the instance. However, we still have a leftover issue: our logs are being flooded with messages like this:

[libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"

…which is again the IP of the “zombie” instance. Again I don’t believe this is anything we can resolve on our end. I’ve gone back and forth with your support and hopefully they’re looking into it now.

1 Like

Hi @enaia,

I’m glad Kurt was able to help you resolve the zombie instance issue!

As for this:

This is normal and should stop after a couple minutes. This is libcluster trying to re-establish a connection from a running node to the lost/down node. After a period of time it stops trying.

You will see messages like this when you deploy a new release of your app because old nodes (and addresses) go away from the cluster.

Hopefully that helps you feel more comfortable with these non-error messages.

1 Like

@Mark sorry if I wasn’t clear, but these messages have been occurring for months, with the same IPv6 address - specifically, the IPv6 address of the specific “zombie” instance that Kurt killed over a month ago. Here’s the context:

additional context here: Instance never shut down?

Here are log messages from today, showing the same address:

1 Like

Hi @enaia,

Thanks for the additional background! That is unusual. I haven’t seen that before.

I see from your log display that libcluster is trying to connect to the node that was removed. Because of the fly6pn topology, I’m assuming you are using the Cluster.Strategy.DNSPoll strategy. If that’s true, then I doubt the project has any nodes explicitly configured into it like the docs show is possible. That just wouldn’t make sense.

Are all the nodes identical Elixir apps? Or are there multiple apps in the same Fly organization?

Wondering if this issue is relevant to your situation.

@Mark the nodes are identical Elixir apps. We have other apps running in the same organization, but these have different names. The “zombie” instance was part of the cluster for our production application when it was alive. Even when it was no longer visible In the console, it was processing Oban jobs so it had the correct DB credentials, etc.

Our config is pretty standard:

runtime.exs:

  app_name =
    System.get_env("FLY_APP_NAME") ||
      raise "FLY_APP_NAME not available"

  [...]

  config :libcluster,
    debug: true,
    topologies: [
      fly6pn: [
        strategy: Cluster.Strategy.DNSPoll,
        config: [
          polling_interval: 5_000,
          query: "#{app_name}.internal",
          node_basename: app_name
        ]
      ]
    ]

application.ex:

    children = [
     [...]
      {Cluster.Supervisor, [topologies, [name: Enaia.ClusterSupervisor]]},
     [...]
    ]

mix.exs:

      releases: [
        enaia: [
           [...]
          cookie: "<static cookie>"
        ]
      ]

rel/env.sh.eex:

ip=$(grep fly-local-6pn /etc/hosts | cut -f 1)
export RELEASE_DISTRIBUTION=name
export RELEASE_NODE=$FLY_APP_NAME@$ip

Dockerfile:

# Appended by flyctl
ENV ECTO_IPV6 true
ENV ERL_AFLAGS "-proto_dist inet6_tcp"

@enaia,

Yes, that all looks good and normal.

I also assume that the app has been deployed at least one since the zombie instance was shutdown? Have you tried doing a fly apps restart <APP_NAME>. The idea being to prod them to forget the removed node.

Yes, the instance was shut down over a month ago and we typically deploy multiple times per day. We’ve restarted many times as well.

1 Like

@enaia,

I’m not aware of anything from the Fly-side that could or would keep around Elixir’s awareness of a removed VM. I suspect it must be something from within the app. Yes, there was a problem where a rogue VM was left around, but with that gone, I can’t think of anything that would make an Elixir node still try to connect to that instance. Can you?

If you iex into a running instance, and run Node.list(), it lists all the other nodes known to the one you are connecting to. If the zombie node is still listed then it must somehow be running and connected… which doesn’t make sense. I’m stumped.

@Mark the zombie instance ID from the log messages doesn’t show when I do Node.list(). We have two VMs running, and when I do Node.list() on one, it only shows the other one.

$ fly ssh console --app production
Connecting to fdaa:0:309a:a7b:92:cf73:54ce:2... complete
/ # app/bin/enaia remote
Erlang/OTP 25 [erts-13.0.4] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [jit]

Interactive Elixir (1.13.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(production@fdaa:0:309a:a7b:92:cf73:54ce:2)1> Node.list()
[:"production@fdaa:0:309a:a7b:9d36:e805:87c3:2"]
$ fly status --app production
[...]
ID      	PROCESS	VERSION	REGION	DESIRED	STATUS 	HEALTH CHECKS     	RESTARTS	CREATED    
cf7354ce	app    	1114   	iad   	run    	running	1 total, 1 passing	0       	54m14s ago	
e80587c3	app    	1114   	iad   	run    	running	1 total, 1 passing	0       	54m53s ago

Hey @enaia,

The only other idea for a source that I have is possibly around the Oban jobs? I don’t have any production experience with Oban, but, is it possible that jobs logged by the old node, or worked on but errored, that the application code might use the source node name in some way? Like it’s trying to reach out to the node based on that information?

That behavior might be specific to the application.

Hope you had a good New Years! And may your logs be happy and clean this year. :slight_smile:

Thanks Mark, I’m not sure that Oban has anything to do with this. I think the issue is that libcluster continues to believe that the now deleted VM is part of the cluster and keeps trying to connect to it.

Just to recap: we originally determined we had an issue when we noticed that Oban jobs were being processed by a node with a clock that was off by several minutes. At that point we asked for help, and Kurt eventually determined that there was a “zombie” instance resulting from a Fly outage, which he then cleaned up. That node seems to have been shut down - at least, it’s no longer processing Oban jobs.

I don’t know the details of how libcluster tracks nodes, but when that VM was shut down, libcluster didn’t recognize it and continues to try to connect to it. This continues to happen despite many restarts. The old node ID must be stored persistently somewhere?

I don’t think there’s anything unusual in our code - I posted everything in our application related to clustering and it seems completely standard. We don’t have the time to dig into the internals of libcluster to figure out how this could have occurred. So, we’re stumped and unless anyone has other ideas, I think the best we can do is to schedule some downtime and move to a new application.

A brief downtime would be to stop all instances and then bring them back up. If anything is somehow caching it in memory, then that should flush it out.

I’m not aware of libcluster doing anything to hold-on to that information though. Very weird. The only other idea is to fly ssh console into a node and do some non-Elixir networking checks. But I wouldn’t expect to find anything anyway.

@Mark we did a complete shutdown by scaling to 0 instances, confirmed that no VMs were running, then scaled back up. The logs still shows ongoing failed connections to production@fdaa:0:309a:a7b:4b:e84f:55b5:2. If that instance ID only exists in memory, I’m not sure how it would still be present after a complete shutdown. Let me know if you see anything wrong here.

$  fly scale count 0  --app production
Count changed to 0
$ fly status --app production
App
  Name     = production          
  Owner    = enaia               
  Version  = 1139                
  Status   = dead                
  Hostname = production.fly.dev  
  Platform = nomad               

Instances
ID	PROCESS	VERSION	REGION	DESIRED	STATUS	HEALTH CHECKS	RESTARTS	CREATED 

$ fly scale count 2  --app production
Count changed to 2

$ flyctl logs -a production
[...]
2023-01-07T08:00:10Z app[abb388b3] iad [info]08:00:10.775 [warning] [libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"
2023-01-07T08:00:18Z app[66585a8e] iad [info]08:00:18.096 [warning] [libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"
2023-01-07T08:00:22Z app[abb388b3] iad [info]08:00:22.780 [warning] [libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"
2023-01-07T08:00:30Z app[66585a8e] iad [info]08:00:30.100 [warning] [libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"
2023-01-07T08:00:34Z app[abb388b3] iad [info]08:00:34.784 [warning] [libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"
2023-01-07T08:00:42Z app[66585a8e] iad [info]08:00:42.104 [warning] [libcluster:fly6pn] unable to connect to :"production@fdaa:0:309a:a7b:4b:e84f:55b5:2"

@enaia,

Holy cow! :exploding_head: Yes, you scaled it to 0 and no instances were running.

Here’s another idea. Get a list of the private IPv6 addresses for all of your instances.

fly ips private

Then hopefully you can learn something from these tools:

fly dig <ip_address>
fly ping <ip_address>

You can test it against a known IP instance and against the phantom IP.

Holy cow! :exploding_head:

I know! :confused:

$ fly ips private -a production
ID      	REGION	IP                               
19a979b3	iad   	fdaa:0:309a:a7b:9d36:19a9:79b3:2	
8fa07628	iad   	fdaa:0:309a:a7b:92:8fa0:7628:2  	

$ fly dig fdaa:0:309a:a7b:9d36:19a9:79b3:2 -a production # GOOD IP
;; opcode: QUERY, status: NXDOMAIN, id: 65392
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;fdaa:0:309a:a7b:9d36:19a9:79b3:2.	IN	 AAAA

;; AUTHORITY SECTION:
.	1	IN	SOA	a.root-servers.net. nstld.verisign-grs.com. 2023010901 1800 900 604800 86400

;; ADDITIONAL SECTION:

;; OPT PSEUDOSECTION:
; EDNS: version 0; flags: ; udp: 4096

$ fly ping fdaa:0:309a:a7b:9d36:19a9:79b3:2 -a production # GOOD IP
35 bytes from fdaa:0:309a:a7b:9d36:19a9:79b3:2 (fdaa:0:309a:a7b:9d36:19a9:79b3:2), seq=0 time=74.6ms
35 bytes from fdaa:0:309a:a7b:9d36:19a9:79b3:2 (fdaa:0:309a:a7b:9d36:19a9:79b3:2), seq=1 time=76.7ms
^C

$ fly dig fdaa:0:309a:a7b:4b:e84f:55b5:2 -a production  # BAD IP
;; opcode: QUERY, status: NXDOMAIN, id: 34893
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;fdaa:0:309a:a7b:4b:e84f:55b5:2.	IN	 AAAA

;; AUTHORITY SECTION:
.	1	IN	SOA	a.root-servers.net. nstld.verisign-grs.com. 2023010901 1800 900 604800 86400

;; ADDITIONAL SECTION:

;; OPT PSEUDOSECTION:
; EDNS: version 0; flags: ; udp: 4096

$ fly ping fdaa:0:309a:a7b:4b:e84f:55b5:2 -a production # BAD IP

The ping of the bad IP hung indefinitely.

1 Like