Intermittent Phoenix PubSub subscription failure

We have an application deployed on a Fly instance that is intermittently not receiving broadcasts from Phoenix PubSub. On our development machines nothing is missed, on Fly, it’s random, but at least 50% of the messages that are broadcast are not picked up by the subscription. In addition it’s not certain subscriptions that are failing, it’s random there too.

As far as I can tell we have a vanilla PubSub implementation, and our Fly install is standard other than we are connectiong to an external database on CrunchyBridge. Our Fly install is running on two machines “shared-cpu-1x”.

I’m posting here as it feels like a network issue, and I’m hoping someone has some ideas about where we can look to troubleshoot this further.

Thanks in advance.

EDIT I should add to this that there are no errors being shown.

As an update to this, I have found that if I reduce the two machines to one I no longer lose the messages.

Have I missed something in my config to ensure that both machines can listen for PubSub notifications?

The 1/2 error rate with two Machines suggests that they’re maybe thinking of themselves as being two separate clusters. (I.e., not really talking to each other.)

Could you perhaps post the section of your code that defines the clustering—particularly config :libcluster and env.sh.eex?

https://fly.io/docs/elixir/the-basics/clustering/

Those two have turned out to help people with similar sounding problems in the past.

Also, what does Node.list() tell you when run from iex (from within one of the Machines)?

https://fly.io/docs/elixir/the-basics/troubleshooting/

I have no mention of :libcluster anywhere in configuration.

env.sh.eex has the default content provided by Fly.

#!/bin/sh

# configure node for distributed erlang with IPV6 support
export ERL_AFLAGS="-proto_dist inet6_tcp"
export ECTO_IPV6="true"
export DNS_CLUSTER_QUERY="${FLY_APP_NAME}.internal"
export RELEASE_DISTRIBUTION="name"
export RELEASE_NODE="${FLY_APP_NAME}-${FLY_IMAGE_REF##*-}@${FLY_PRIVATE_IP}"

Running Node.list() may be an issue, it’s an empty list. [].

Just to avoid ambiguity… [] is what I was expecting you to get (based on your original problem description); it confirms the hypothesis that there are no other nodes in that cluster.

Ok, more broadly what code do you have that does cluster discovery?

erlang-discovery-bd2a77

Possibilities include libcluster, dns_cluster, and peerage.

Or someone else in your group might have implemented something specific to the details of your local environment.

@mayailurus

Thanks for your help. We didn’t have the clustering set up, we had, rather naively assumed that if we had two machines as a default they’d be clustered. Lesson learned.

You put us track, thank you.

Best,
Adam