Clarification about regional deployment and postgres databases

That happens when the coordinator (the sentinel) can’t connect to the local pg (the keeper) – those messages are normal on boot since pg takes longer to start than the sentinel. We’re using stolon for HA, you can check out their repo if you want to learn more: GitHub - sorintlab/stolon: PostgreSQL cloud native High Availability and more..

OK, thanks. I’m still seeing this message right now despite the two instances passing health checks. Should that be a concern?

I haven’t seen that before, so maybe a concern. We’re looking into it.

I don’t see issues with any other VMs from other pg clusters, only that one. Could you try adding a new volume then scaling up by one to see if the new VM joins cleanly?

Still seeing these errors after scaling up by one.

Unfortunately, about 2 hours ago, it looks like my primary is not able to startup up so my app is failing. I’ll try logging in to debug, but would appreciate any help.

c55e9a7 14      ams    run     running (offline) 3 total, 1 passing, 1 critical 0        4h44m ago
9e8af553 14      ams    run     running (leader)  3 total, 2 passing, 1 critical 0        2021-05-02T16:12:38Z
0448f13c 14      ams    run     running (offline) 3 total, 1 passing, 1 critical 0        2021-04-30T21:36:12Z

This is an issue on our end we’re working on. We’ll update one it’s fixed.

OK, thanks. I didn’t see an update from the status page, so is it just affecting my account?

FWIW, I do now see a leader running, after scaling the cluster down and up.

You should be good now.

The leader failed because it couldn’t connect to consul, which was having unrelated intermittent issues. Since it couldn’t connect to consul to lock during an election, pg failed to start. We’re still trying to figure out why consul had issues and will be adding monitoring to prevent this going forward.

The other issue you saw, “no keeper info available”, was a scary but inconsequential warning caused by removing a vm & it’s volume without removing it from the cluster first. We’ll be looking at how to do that automatically for you.

OK - is there some internal monitoring we can setup to detect this sort of issue? And/or be able to run the master even if consul is down?

On a side note, having a cluster down like this could be worked around if a backup were available to restore from. Are you looking into (or interested in) adding WAL archiving or another automated backup solution to the Docker image?

1 Like

Just checking on this - is there way to clear out the ‘no keeper info available’ error now on an existing cluster without taking it down?

We’re working on this and are close to a solution! There are some stolon settings we needed to tweak.

We’ll be able to release an updated config soon.

1 Like