Postgres cluster is unstable

So according to the fly checks our pg cluster app is phasing int and out of a critical status by failing the vm check

In normal circumstances this would be just fine but every time it becomes critical we loose all our LISTEN/NOTIFY connections and they cannot be renewed automatically unless we restart our connected applications.

This always happens on weekends and we suspect consul is at fault cause that’s the explanation we got last time.

For example there is not a 1 minute difference between these 3 notifications.

Can you check it please ? :pray:

Those check failures aren’t related to consul, those are checks indicating that the VM is CPU constrained. The VMs are probably restarting when that check fails multiple times in a row.

You can change this behavior.

  • fly config save -a <database-app-name>
  • For health checks you want to just let fail repeatedly, set this: restart_limit = 0
  • fly deploy -i flyio/postgres:12.8

This should prevent VMs from restarting when checks fail.