Log level for postgres cluster

I want to post an update here about my experience.

  1. I followed the steps mentioned above and set the [env] variable STKEEPER_LOG_LEVEL to "warn" as mentioned in the documentation at https://github.com/sorintlab/stolon/blob/master/doc/commands_invocation.md and https://github.com/sorintlab/stolon/blob/master/doc/commands/stolon-keeper.md
  2. I then carefully read through Managing Fly Postgres · Fly Docs to make sure I was not about to do anything stupid.
  3. I then ran the final fly deploy . --image flyio/postgres-flex:15.3 command followed by fly pg failover command, as mentioned in the original answer as well as at High Availability & Global Replication · Fly Docs.

This broke my PG cluster and moved it into Postgres is down, cannot restart. Error no active leader found. state.

To be fair, I am not sure that these steps are the sole culprit, as looking through PG logs showed me there were many ERROR logs before this as well, but at least the PG cluster was available at that point in time. As a side-note, all PG logs are logged at INFO level, with the words WARNING or ERROR in the actual message of the log, instead of the level of the log. So in the end this entire activity is unproductive because I need error logs to be streamed to my aggregator.

In any case, my production app was now fully down and I was panicking. Eventually, I was able to get the PG cluster back again through the following steps (after a 27 minute downtime):

  1. Luckily, all checks were passing on a single node. The other two nodes of my cluster were in stopped/error and started/error state respectively.
  2. I cloned the healthy node using fly machines clone <id> --region <region> --app <db-app> twice to add 2 new nodes. These came up healthy.
  3. I then destroyed the bad nodes with fly machine destroy <id> and fly machine destroy <id> --force for the node that was stuck in started/error mode.

Finally, after the bad nodes were deleted, the entire cluster became healthy again. At this point, I restarted my app nodes just to be sure that connection pools were flushed and everything started working again.

PS: The STKEEPER_LOG_LEVEL var did not change anything for me.