[Elixir Phoenix] libcluster unable to connect anymore

I am no longer able to cluster my nodes using libcluster, including the node I am currently on. I don’t believe I have made any change that would cause this, I have gone through the “Run an Elixir app” tutorial again to see if something had changed and noticed some new configs around IPV6 since the last time I had to configure libcluster on fly. I made changes accordingly and now my config seems to match in every way.

I have tried to delete and recreate the app as well as using different names but to no avail
(also maybe worth noting, on some apps I cannot call fly logs anymore because it hangs forever).

Logs:

app[da84b6a6] cdg [info]2022/07/01 21:13:54 listening on [<MY_CURRENT_IP_V6>]:22 (DNS: [fdaa::3]:53)
app[da84b6a6] cdg [info]Reaped child process with pid: 551, exit code: 0
app[da84b6a6] cdg [info]Reaped child process with pid: 573 and signal: SIGUSR1, core dumped? false
app[da84b6a6] cdg [info][libcluster:fly6pn] unable to connect to :"<MY_APP>@<MY_CURRENT_IP_V6>"
app[da84b6a6] cdg [info][libcluster:fly6pn] unable to connect to :"<MY_APP>@<ANOTHER_IP_V6>"
# ANOTHER_IP_V6 is probably an old address used for this app

In my runtime.exs:

  app_name =
    System.get_env("FLY_APP_NAME") ||
      raise "FLY_APP_NAME not available"

  config :libcluster,
    debug: true,
    topologies: [
      fly6pn: [
        strategy: Cluster.Strategy.DNSPoll,
        config: [
          # default is 5_000
          # polling_interval: 5_000,
          query: "#{app_name}.internal",
          node_basename: app_name
        ]
      ]
    ]

In my Application’s children:

    topologies = Application.get_env(:libcluster, :topologies) || []

    children = [
      # a bunch of children
      MyApp.Endpoint,
      {Cluster.Supervisor, [topologies, [name: MyApp.ClusterSupervisor]]}
    ]

Elixir version: 1.13
libcluster version: 3.3.1

I’ve noticed something like this, but I have a few questions to see if it’s the same. Do you only see this during deployments? Do you use a static release cookie? What is returned when you ssh to the remote console and call Node.list()?

I have noticed the same unable to connect messages during bluegreen deployments, but I use a static cookie so the ingoing and outgoing VMs should be able to connect. However, once the new ones are up and the olds ones are stopped, the messages go away and my cluster is connected. I can verify this with Node.list(), and also by viewing the LiveDashboard in production.

I see this as soon as I deploy and it never resolves. I have a similar app with the same config that I have not deployed to since I noticed this issue because I am afraid it will fail similarly.

Right now I cannot ssh on this particular app because I get:

Error host unavailable: host was not found in DNS

Which might be part of the problem?

When I was able to ssh, Node.list() would return an empty list.

I do use a static cookie and I have tried both with and without bluegreen deployment enabled. I set the cookie up in mix.exs like this:

  defp releases() do
    [
      my_release_name: [
        include_executables_for: [:unix],
        cookie: <STATIC_COOKIE>
      ]
    ]
  end

But you have a scale count greater than 1?

I have other apps that should connect to this cluster (same cookie and config) so Node.list should not be empty

I could be wrong, but I think since your cluster is querying for other nodes with the FLY_APP_NAME, which will be different in the env of your different apps, then the standard config shouldn’t connect different apps by default, even on the same private network. I would look into that first.

Other things to check are that these lines are in your Dockerfile:

ENV ECTO_IPV6 true
ENV ERL_AFLAGS "-proto_dist inet6_tcp"

and in your /rel/env.sh.eex

#!/bin/sh

ip=$(grep fly-local-6pn /etc/hosts | cut -f 1)
export RELEASE_DISTRIBUTION=name
export RELEASE_NODE=$FLY_APP_NAME@$ip

Sounds like you might have already caught those after going back through the guide.

I wish I could help more, I haven’t had experience with this particular problem, but the node name sounds like a promising lead to me

1 Like

Ahhh of course! I believe adding libcluster configs for each app has indeed solved the issue, I should have thought of that…

I still see a lot of unable to connect which probably shouldn’t be there but it might be temporary, we’ll see.

Thank you for your help!

1 Like

If you get it working, please post your libcluster config to help others with the same problem in the future!

1 Like

Good idea, here is my config for two apps:

  config :libcluster,
    debug: true,
    topologies: [
      first_app: [
        strategy: Cluster.Strategy.DNSPoll,
        config: [
          # default is 5_000
          # polling_interval: 5_000,
          query: "<FIRST_APP_FLY_NAME>.internal",
          node_basename: "<FIRST_APP_FLY_NAME>"
        ]
      ],
      second_app: [
        strategy: Cluster.Strategy.DNSPoll,
        config: [
          # default is 5_000
          # polling_interval: 5_000,
          query: "<SECOND_APP_FLY_NAME>.internal",
          node_basename: "<SECOND_APP_FLY_NAME>"
        ]
      ]
    ]
2 Likes

Also, I found what the problem was with the “unable to connect” warnings. Contrary to what I said above, I had created a new app and forgot to run mix releases.init on it with the proper config of rel/env.sh.eex (I had run it on the first app, but forgot it was not enough).

As a result, my node names were not properly configured as “FLY_APP_NAME@ip” and libcluster was not able to connect to them properly.

2 Likes

Hey,
I see the same behaviour when I deploy. During the deployment I see those warnings although I share a static cookie using the RELEASE_COOKIE env variable as described here. Any progress to fix that on your side? It’s kinda annoying to see get this with every deployment…