Log level for postgres cluster

Hey all,

I set up a small postgres cluster via flyctl today. I’ve not really done much with it but have noticed that so far that application has generated about 500mb of outbound data (over about 12 hours).

It seems to me that this data is the logs being put out by the app, it’s not clear what else it could be. There are quite a lot of logs and it’s almost entirely from stolon keeper.

Is there a way to change the log level from INFO to WARN? Looking at the stolon repo it seems I should be able to control this with the STKEEPER_LOG_LEVEL env var:

As I don’t have a fly.toml file for it I tried setting it with a secret as a work-around. This doesn’t seem to work though.

Thanks!

Tom

This is probably not log level related. I don’t believe logs are included in your data transfer. It’s probably replication using the data, there is quite a lot of traffic between postgres vms with stolon.

That said, changing the log level would be pretty handy. You can experiment with those env vars. You have to pull down the app config, edit it, and then redeploy:

  • Make a directory like my-db, then cd my-db
  • fly config save -a <database-app-name>
  • Set [env] variables in fly.toml
  • fly image show, save the tag value of the current image
  • fly deploy -i flyio/postgres:<tag>.
2 Likes

Hey - thanks for the quick reply and good to know re the logs/billing.

I’m surprised at the amount of traffic to be honest - so far I’ve just written a small amount of data into it (well under 1mb) and created a schema. I assumed it was the logs as it was the only thing that seemed like it could be large enough.

Great tip with fly config, thanks for that :slight_smile:, I was wondering if it was possible to do that. I will have a play and report back here in case others are interested.

@tdfirth came across this post as we are seeing the same in our logs, stolon does a ton of logging, and we’d like to cut down if possible to cut through the noise. I’ve been unsuccessful in my attempts to cut down on the logging level, were you able to change the logging level or figure out a way to cut down on the logs?

I want to post an update here about my experience.

  1. I followed the steps mentioned above and set the [env] variable STKEEPER_LOG_LEVEL to "warn" as mentioned in the documentation at https://github.com/sorintlab/stolon/blob/master/doc/commands_invocation.md and https://github.com/sorintlab/stolon/blob/master/doc/commands/stolon-keeper.md
  2. I then carefully read through Managing Fly Postgres · Fly Docs to make sure I was not about to do anything stupid.
  3. I then ran the final fly deploy . --image flyio/postgres-flex:15.3 command followed by fly pg failover command, as mentioned in the original answer as well as at High Availability & Global Replication · Fly Docs.

This broke my PG cluster and moved it into Postgres is down, cannot restart. Error no active leader found. state.

To be fair, I am not sure that these steps are the sole culprit, as looking through PG logs showed me there were many ERROR logs before this as well, but at least the PG cluster was available at that point in time. As a side-note, all PG logs are logged at INFO level, with the words WARNING or ERROR in the actual message of the log, instead of the level of the log. So in the end this entire activity is unproductive because I need error logs to be streamed to my aggregator.

In any case, my production app was now fully down and I was panicking. Eventually, I was able to get the PG cluster back again through the following steps (after a 27 minute downtime):

  1. Luckily, all checks were passing on a single node. The other two nodes of my cluster were in stopped/error and started/error state respectively.
  2. I cloned the healthy node using fly machines clone <id> --region <region> --app <db-app> twice to add 2 new nodes. These came up healthy.
  3. I then destroyed the bad nodes with fly machine destroy <id> and fly machine destroy <id> --force for the node that was stuck in started/error mode.

Finally, after the bad nodes were deleted, the entire cluster became healthy again. At this point, I restarted my app nodes just to be sure that connection pools were flushed and everything started working again.

PS: The STKEEPER_LOG_LEVEL var did not change anything for me.