Postgres (PG) Database (DB) issue: "checking stolon status" and "Error opening connection to database"

We have a nomad based PG cluster with fly that has somehow gotten into a bad state.

  • The Leader is functioning normally
  • The replica instance is failing two health checks (but is queryable still)
  • Scaling up new replicas result in 2/3 failed health checks as well

The replicas constantly output the follow logs and never complete health checks

  • checking stolon status
  • Error opening connection to database

Here is a screenshot of the logs and app status

I am unable to issue the fly pg restart -a gohappy-hub-db command as it gives the error context deadline exceeded

Hey, checking the health checks might give more insight on what’s going on fly checks list --app <app-name>.

ah, was not aware of this command thank you!

not quite sure how to make sense of these errors, anyone have clues on what went wrong here?