Postgres (PG) Database (DB) issue: "checking stolon status" and "Error opening connection to database"

We have a nomad based PG cluster with fly that has somehow gotten into a bad state.

  • The Leader is functioning normally
  • The replica instance is failing two health checks (but is queryable still)
  • Scaling up new replicas result in 2/3 failed health checks as well

The replicas constantly output the follow logs and never complete health checks

  • checking stolon status
  • Error opening connection to database

Here is a screenshot of the logs and app status

I am unable to issue the fly pg restart -a gohappy-hub-db command as it gives the error context deadline exceeded

Hey, checking the health checks might give more insight on what’s going on fly checks list --app <app-name>.

ah, was not aware of this command thank you!

not quite sure how to make sense of these errors, anyone have clues on what went wrong here?

where you able to fix this, I’m having the same issue now , @kurt please help