I have a postgres cluster which I created, scaled up to a couple of instances and then scaled back down to a single vm.
However the vm is now reporting in the logs:
cmd/sentinel.go:276 no keeper info available
This message appears every few seconds, there are also two replications slots which appear to be stale, causing a runway WAL which eventually led to a full disk.
How do I recover this cluster so that there are no stale replication slots and the WAL is properly cleared?
I had to expand the disk to recover availability, so now my disk is also oversized and needs reducing. The WAL is growing rapidly and this isn’t sustainable. This seems to come up quite often on the forums and no one has provided an actual solution beyond “we fixed it now” Which is no good.
However there are still stale replica slots and dropping them just causes them to be re-created so this is very much time sensitive as the WAL will fill the disk again.
Thanks, I have been running into this too and this saved my cluster (only after scaling the volumes all the way to 100GB each). Hope Fly will fix this natively as this is a pretty sharp edge case.
We discovered an issue with Stolon that was preventing failed keepers from getting cleaned up as expected. This issue has been patched and addressed with release v0.0.29.
You can upgrade to the latest release via: fly image update --app <app-name>
Also, as a side note:
With this upgrade, you should also no longer need to export any additional environment variables in order to leverage stolonctl commands.
If you have any questions on this, just let us know!
When you have the runaway log I found these commands helpful to kill the orphaned replica.
# connect to postgres machine
fly ssh console -a <postgres_fly_app_name>
# set the environment variables
export $(cat /data/.env | xargs)
# check the stolon status
stolonctl status
# kill the runway stolon.
stolonctl removekeeper <keeper-id>