Leader node has issues passing the CPU health checks even after crazy scale up

Our cluster database is having a weakly database issue :sweat: it is failing CPU health checks even after we upgraded to a dedicated-cpu-4x.
Another strange experience is that we are trying to stop the leader in order to obtain a new leader but that seems to not work too.

@shema Mind sending me the name of the App that’s experiencing issues?

Another strange experience is that we are trying to stop the leader in order to obtain a new leader but that seems to not work too.

Stolon will wait up to 20 seconds for the primary to recover before it will issue a failover. That being said, the failover likely isn’t happening for one of two reasons:

  1. The VM is being rescheduled and recovering before the failover threshold is hit.
  2. The replica isn’t healthy or is in a state where it’s ineligible to become primary.

It’s likely that it’s due to #1.

Instead of stopping your VM, I would recommend using stolonctl to perform the failover.

You can achieve this by doing the following:

  1. Ssh into one of your VM’s.
  2. Run: export $(cat /data/.env | xargs) to export the necessary env vars.
  3. stolonctl status to see your cluster status, keeper health, etc. You can also use this to obtain the keeper id of your primary.
  4. stolonctl failkeeper <primary-keeper-id>

Here’s some documentation on failkeeper: https://github.com/sorintlab/stolon/blob/master/doc/commands/stolonctl_failkeeper.md

It looks like we are not the only ones with pg issues this time: :smile:. Our cluster has been throwing crazy errors here is a sample of our logs:

2021-10-03T16:20:27.324398473Z app[92becc29] lhr [info] exporter | INFO[0000] Starting Server: :9187                        source="postgres_exporter.go:1837"
2021-10-03T16:20:27.543741617Z app[92becc29] lhr [info] proxy    | 2021-10-03T16:20:27.542Z	INFO	cmd/proxy.go:419	proxy uid	{"uid": "9b38836a"}
2021-10-03T16:20:27.546088913Z app[92becc29] lhr [info] sentinel | 2021-10-03T16:20:27.542Z	INFO	cmd/sentinel.go:2000	sentinel uid	{"uid": "e4a42daf"}
2021-10-03T16:20:27.546872995Z app[92becc29] lhr [info] keeper   | 2021-10-03T16:20:27.545Z	INFO	cmd/keeper.go:2091	exclusive lock on data dir taken
2021-10-03T16:20:27.728848917Z app[92becc29] lhr [info] exporter | INFO[0000] Established new database connection to "fdaa:0:1af2:a7b:a9a:0:3e2d:2:5433".  source="postgres_exporter.go:970"
2021-10-03T16:20:28.169428456Z app[92becc29] lhr [info] checking stolon status
2021-10-03T16:20:28.729348725Z app[92becc29] lhr [info] exporter | ERRO[0001] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:1af2:a7b:a9a:0:3e2d:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:1af2:a7b:a9a:0:3e2d:2]:5433: connect: connection refused  source="postgres_exporter.go:1658"
2021-10-03T16:20:28.862382934Z app[92becc29] lhr [info] sentinel | 2021-10-03T16:20:28.861Z	INFO	cmd/sentinel.go:82	Trying to acquire sentinels leadership
2021-10-03T16:20:34.846144971Z app[92becc29] lhr [info] exporter | INFO[0007] Established new database connection to "fdaa:0:1af2:a7b:a9a:0:3e2d:2:5433".  source="postgres_exporter.go:970"
2021-10-03T16:20:35.846833408Z app[92becc29] lhr [info] exporter | ERRO[0008] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:1af2:a7b:a9a:0:3e2d:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:1af2:a7b:a9a:0:3e2d:2]:5433: connect: connection refused  source="postgres_exporter.go:1658"
2021-10-03T16:20:37.352814411Z app[cc1b5664] lhr [info] proxy    | 2021-10-03T16:20:37.352Z	INFO	cmd/proxy.go:304	check timeout timer fired
2021-10-03T16:20:42.728154485Z app[92becc29] lhr [info] exporter | INFO[0015] Established new database connection to "fdaa:0:1af2:a7b:a9a:0:3e2d:2:5433".  source="postgres_exporter.go:970"
2021-10-03T16:20:43.728942388Z app[92becc29] lhr [info] exporter | ERRO[0016] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:1af2:a7b:a9a:0:3e2d:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:1af2:a7b:a9a:0:3e2d:2]:5433: connect: connection refused  source="postgres_exporter.go:1658"

Our staging and production databases in ORD are both offline and have brought down our app. More information is posted in the other two two threads about the pg issues.