I have a postgres cluster which I created, scaled up to a couple of instances and then scaled back down to a single vm.
However the vm is now reporting in the logs:
cmd/sentinel.go:276 no keeper info available
This message appears every few seconds, there are also two replications slots which appear to be stale, causing a runway WAL which eventually led to a full disk.
How do I recover this cluster so that there are no stale replication slots and the WAL is properly cleared?
I had to expand the disk to recover availability, so now my disk is also oversized and needs reducing. The WAL is growing rapidly and this isn’t sustainable. This seems to come up quite often on the forums and no one has provided an actual solution beyond “we fixed it now” Which is no good.
I figured out how to reduce the disk size:
- Create a new volume of the desired size.
- Scale up the DB cluster
- Wait for it to become stable.
- Stop the VMs on the old disk size and scale down the cluster.
- Delete the old (oversized) volumes.
However there are still stale replica slots and dropping them just causes them to be re-created so this is very much time sensitive as the WAL will fill the disk again.
So after some faffing about I think I finally managed to resovle it.
fly ssh console to access the db instance.
I then spent a while figuring out what params I needed to pass into
stolonctl to get the status:
stolonctl status --cluster-name $FLY_APP --store-backend consul --store-url $FLY_CONSUL_URL
Then I could identify and remove the unhealthy (Removed) keepers and use
This appears to have resolved the issue but took an unnecessarily long time to figure out.
Thanks, I have been running into this too and this saved my cluster (only after scaling the volumes all the way to 100GB each). Hope Fly will fix this natively as this is a pretty sharp edge case.
Thanks! I see 48 hours is mentioned in this issue but just want to note this impacted my app for weeks and did not go away on its own.
It might be worth a doc update to the HA Postgres section, to note that scaling has this edge case right now.
@iangcarroll Would you be able to let us know what image version you’re running?
fly image show --app <app-name>
Registry = registry-1.docker.io
Repository = flyio/postgres
Tag = 14.4
Version = v0.0.25
Digest = sha256:e60ad3f6f1e33c07d3b8a353df0243de5a27d6541961464023943bcae2eb080a
We discovered an issue with Stolon that was preventing failed keepers from getting cleaned up as expected. This issue has been patched and addressed with release
You can upgrade to the latest release via:
fly image update --app <app-name>
Also, as a side note:
With this upgrade, you should also no longer need to export any additional environment variables in order to leverage
If you have any questions on this, just let us know!
cc:// @iangcarroll @LeoAdamek