No Keeper Available, runaway WAL

I have a postgres cluster which I created, scaled up to a couple of instances and then scaled back down to a single vm.

However the vm is now reporting in the logs:

cmd/sentinel.go:276 no keeper info available

This message appears every few seconds, there are also two replications slots which appear to be stale, causing a runway WAL which eventually led to a full disk.

How do I recover this cluster so that there are no stale replication slots and the WAL is properly cleared?

I had to expand the disk to recover availability, so now my disk is also oversized and needs reducing. The WAL is growing rapidly and this isn’t sustainable. This seems to come up quite often on the forums and no one has provided an actual solution beyond “we fixed it now” Which is no good.

I figured out how to reduce the disk size:

  • Create a new volume of the desired size.
  • Scale up the DB cluster
  • Wait for it to become stable.
  • Stop the VMs on the old disk size and scale down the cluster.
  • Delete the old (oversized) volumes.

However there are still stale replica slots and dropping them just causes them to be re-created so this is very much time sensitive as the WAL will fill the disk again.

So after some faffing about I think I finally managed to resovle it.

I used fly ssh console to access the db instance.

I then spent a while figuring out what params I needed to pass into stolonctl to get the status:

stolonctl status --cluster-name $FLY_APP --store-backend consul --store-url $FLY_CONSUL_URL

Then I could identify and remove the unhealthy (Removed) keepers and use stolonctl removekeeper

This appears to have resolved the issue but took an unnecessarily long time to figure out.

1 Like

Thanks, I have been running into this too and this saved my cluster (only after scaling the volumes all the way to 100GB each). Hope Fly will fix this natively as this is a pretty sharp edge case.

Here’s some additional information about this issue: Reconfigure Stolon's "deadKeeperRemovalInterval · Issue #34 · fly-apps/postgres-ha · GitHub

Thanks! I see 48 hours is mentioned in this issue but just want to note this impacted my app for weeks and did not go away on its own.

It might be worth a doc update to the HA Postgres section, to note that scaling has this edge case right now.

@iangcarroll Would you be able to let us know what image version you’re running?

fly image show --app <app-name>

@shaun

Image Details
  Registry   = registry-1.docker.io
  Repository = flyio/postgres
  Tag        = 14.4
  Version    = v0.0.25
  Digest     = sha256:e60ad3f6f1e33c07d3b8a353df0243de5a27d6541961464023943bcae2eb080a
1 Like

We discovered an issue with Stolon that was preventing failed keepers from getting cleaned up as expected. This issue has been patched and addressed with release v0.0.29.

You can upgrade to the latest release via:
fly image update --app <app-name>

Also, as a side note:

With this upgrade, you should also no longer need to export any additional environment variables in order to leverage stolonctl commands.

If you have any questions on this, just let us know!

cc:// @iangcarroll @LeoAdamek

2 Likes

v0.0.33 - problem still exists…

stolonctl status --cluster-name $FLY_APP --store-backend consul --store-url $FLY_CONSUL_URL
nil cluster data: <nil>

The ENV Vars are set.

I can also confirm that on 0.0.41 this issue still exists.

I have a write up related to an incident about it.

When you have the runaway log I found these commands helpful to kill the orphaned replica.

# connect to postgres machine
fly ssh console -a <postgres_fly_app_name>
# set the environment variables
export $(cat /data/.env | xargs)
# check the stolon status
stolonctl status
# kill the runway stolon.
stolonctl removekeeper <keeper-id>