No Keeper Available, runaway WAL

LeoAdamek · September 14, 2022, 9:13am

I have a postgres cluster which I created, scaled up to a couple of instances and then scaled back down to a single vm.

However the vm is now reporting in the logs:

cmd/sentinel.go:276 no keeper info available

This message appears every few seconds, there are also two replications slots which appear to be stale, causing a runway WAL which eventually led to a full disk.

How do I recover this cluster so that there are no stale replication slots and the WAL is properly cleared?

LeoAdamek · September 14, 2022, 10:11am

I had to expand the disk to recover availability, so now my disk is also oversized and needs reducing. The WAL is growing rapidly and this isn’t sustainable. This seems to come up quite often on the forums and no one has provided an actual solution beyond “we fixed it now” Which is no good.

LeoAdamek · September 14, 2022, 10:49am

I figured out how to reduce the disk size:

Create a new volume of the desired size.
Scale up the DB cluster
Wait for it to become stable.
Stop the VMs on the old disk size and scale down the cluster.
Delete the old (oversized) volumes.

LeoAdamek · September 14, 2022, 10:50am

However there are still stale replica slots and dropping them just causes them to be re-created so this is very much time sensitive as the WAL will fill the disk again.

LeoAdamek · September 14, 2022, 1:03pm

So after some faffing about I think I finally managed to resovle it.

I used fly ssh console to access the db instance.

I then spent a while figuring out what params I needed to pass into stolonctl to get the status:

stolonctl status --cluster-name $FLY_APP --store-backend consul --store-url $FLY_CONSUL_URL

Then I could identify and remove the unhealthy (Removed) keepers and use stolonctl removekeeper

This appears to have resolved the issue but took an unnecessarily long time to figure out.

iangcarroll · September 29, 2022, 9:30pm

Thanks, I have been running into this too and this saved my cluster (only after scaling the volumes all the way to 100GB each). Hope Fly will fix this natively as this is a pretty sharp edge case.

shaun · September 29, 2022, 9:54pm

Here’s some additional information about this issue: Reconfigure Stolon's "deadKeeperRemovalInterval · Issue #34 · fly-apps/postgres-ha · GitHub

iangcarroll · September 29, 2022, 10:34pm

Thanks! I see 48 hours is mentioned in this issue but just want to note this impacted my app for weeks and did not go away on its own.

It might be worth a doc update to the HA Postgres section, to note that scaling has this edge case right now.

shaun · September 30, 2022, 1:07am

@iangcarroll Would you be able to let us know what image version you’re running?

fly image show --app <app-name>

iangcarroll · September 30, 2022, 1:25am

@shaun

Image Details
  Registry   = registry-1.docker.io
  Repository = flyio/postgres
  Tag        = 14.4
  Version    = v0.0.25
  Digest     = sha256:e60ad3f6f1e33c07d3b8a353df0243de5a27d6541961464023943bcae2eb080a

shaun · October 3, 2022, 8:12pm

We discovered an issue with Stolon that was preventing failed keepers from getting cleaned up as expected. This issue has been patched and addressed with release v0.0.29.

You can upgrade to the latest release via:
fly image update --app <app-name>

Also, as a side note:

With this upgrade, you should also no longer need to export any additional environment variables in order to leverage stolonctl commands.

If you have any questions on this, just let us know!

cc:// @iangcarroll @LeoAdamek

Berndinox · January 1, 2023, 5:39pm

v0.0.33 - problem still exists…

stolonctl status --cluster-name $FLY_APP --store-backend consul --store-url $FLY_CONSUL_URL
nil cluster data: <nil>

The ENV Vars are set.

armiiller · July 31, 2023, 6:20pm

I can also confirm that on 0.0.41 this issue still exists.

I have a write up related to an incident about it.

When you have the runaway log I found these commands helpful to kill the orphaned replica.

# connect to postgres machine
fly ssh console -a <postgres_fly_app_name>
# set the environment variables
export $(cat /data/.env | xargs)
# check the stolon status
stolonctl status
# kill the runway stolon.
stolonctl removekeeper <keeper-id>

Topic		Replies	Views
Replace flapping pg cluster member Questions / Help postgres	7	393	April 25, 2023
Moving HA Postgres cluster to new region postgres	1	411	September 30, 2022
WARN cmd/sentinel.go:276 no keeper info available	6	724	February 19, 2024
After two months, postgres database suddenly stopped working, citing "getadrinfo ENOTFOUND top2.db-app-name.internal". Any tips? Questions / Help postgres	8	696	March 3, 2023
Runaway pg_wal disk usage Questions / Help postgres , storage , volumes	6	1122	January 6, 2024

No Keeper Available, runaway WAL

Related topics