Log level for postgres cluster

vedang · December 2, 2023, 5:53pm

I want to post an update here about my experience.

I followed the steps mentioned above and set the [env] variable STKEEPER_LOG_LEVEL to "warn" as mentioned in the documentation at https://github.com/sorintlab/stolon/blob/master/doc/commands_invocation.md and https://github.com/sorintlab/stolon/blob/master/doc/commands/stolon-keeper.md
I then carefully read through Managing Fly Postgres · Fly Docs to make sure I was not about to do anything stupid.
I then ran the final fly deploy . --image flyio/postgres-flex:15.3 command followed by fly pg failover command, as mentioned in the original answer as well as at High Availability & Global Replication · Fly Docs.

This broke my PG cluster and moved it into Postgres is down, cannot restart. Error no active leader found. state.

To be fair, I am not sure that these steps are the sole culprit, as looking through PG logs showed me there were many ERROR logs before this as well, but at least the PG cluster was available at that point in time. As a side-note, all PG logs are logged at INFO level, with the words WARNING or ERROR in the actual message of the log, instead of the level of the log. So in the end this entire activity is unproductive because I need error logs to be streamed to my aggregator.

In any case, my production app was now fully down and I was panicking. Eventually, I was able to get the PG cluster back again through the following steps (after a 27 minute downtime):

Luckily, all checks were passing on a single node. The other two nodes of my cluster were in stopped/error and started/error state respectively.
I cloned the healthy node using fly machines clone <id> --region <region> --app <db-app> twice to add 2 new nodes. These came up healthy.
I then destroyed the bad nodes with fly machine destroy <id> and fly machine destroy <id> --force for the node that was stuck in started/error mode.

Finally, after the bad nodes were deleted, the entire cluster became healthy again. At this point, I restarted my app nodes just to be sure that connection pools were flushed and everything started working again.

PS: The STKEEPER_LOG_LEVEL var did not change anything for me.

Topic		Replies	Views
How to configure and manage Postgres databases with Stolon?	2	1206	May 8, 2022
WARN cmd/sentinel.go:276 no keeper info available	6	723	February 19, 2024
postgres logs spam	1	381	March 11, 2022
Postgres clusters periodically down across many of our organizations Questions / Help postgres	7	1650	October 13, 2022
What hostname to use for postgres to use stolon proxy Questions / Help elixir , postgres	0	296	October 26, 2022

Log level for postgres cluster

Related topics