Postgres cluster deploy aborted after image update from 14.4 v0.0.28 to 14.4 v0.0.31

I ran fly image update to update my cluster (2 HA nodes in lhr and a read replica in ams) and the machines started to come back up, but there seem to be multiple connection errors. I now have one healthy and read/writable in lhr, and 2 unhealthy (lhr and ams). What should I do?

Strange, i’ve not actually seen this error before. Were you playing with cross-region failovers by chance prior to issuing the update?

Your replica keepers were in a weird state, so I went ahead and reset their configuration and forced a resync. You can do this on your end by issuing a stolonctl removekeeper <keeper-id> from within the console. After doing this, things appear to be back in working order.

If you find that you’re able to reproduce this in any way, please let us know!

No I didn’t do anything to the cluster – it was a snapshot restore created yesterday. One possible oddity is that when the new app came online in lhr it was only a single instance, so I added a second one in lhr, assuming they would become a HA pair, then added a read replica in ams. I made no config changes after that though.

Got it. That all sounds normal and certainly shouldn’t have led to issues. Sorry for the trouble!

1 Like

np, thanks for fixing!