Our cluster database is having a weakly database issue it is failing CPU health checks even after we upgraded to a dedicated-cpu-4x
Another strange experience is that we are trying to stop the leader in order to obtain a new leader but that seems to not work too.
@shema Mind sending me the name of the App that’s experiencing issues?
Another strange experience is that we are trying to stop the leader in order to obtain a new leader but that seems to not work too.
Stolon will wait up to 20 seconds for the primary to recover before it will issue a failover. That being said, the failover likely isn’t happening for one of two reasons:
- The VM is being rescheduled and recovering before the failover threshold is hit.
- The replica isn’t healthy or is in a state where it’s ineligible to become primary.
It’s likely that it’s due to #1.
Instead of stopping your VM, I would recommend using stolonctl
to perform the failover.
You can achieve this by doing the following:
- Ssh into one of your VM’s.
- Run:
export $(cat /data/.env | xargs)
to export the necessary env vars. stolonctl status
to see your cluster status, keeper health, etc. You can also use this to obtain the keeper id of your primary.stolonctl failkeeper <primary-keeper-id>
Here’s some documentation on failkeeper: https://github.com/sorintlab/stolon/blob/master/doc/commands/stolonctl_failkeeper.md
It looks like we are not the only ones with pg issues this time: . Our cluster has been throwing crazy errors here is a sample of our logs:
2021-10-03T16:20:27.324398473Z app[92becc29] lhr [info] exporter | INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1837"
2021-10-03T16:20:27.543741617Z app[92becc29] lhr [info] proxy | 2021-10-03T16:20:27.542Z INFO cmd/proxy.go:419 proxy uid {"uid": "9b38836a"}
2021-10-03T16:20:27.546088913Z app[92becc29] lhr [info] sentinel | 2021-10-03T16:20:27.542Z INFO cmd/sentinel.go:2000 sentinel uid {"uid": "e4a42daf"}
2021-10-03T16:20:27.546872995Z app[92becc29] lhr [info] keeper | 2021-10-03T16:20:27.545Z INFO cmd/keeper.go:2091 exclusive lock on data dir taken
2021-10-03T16:20:27.728848917Z app[92becc29] lhr [info] exporter | INFO[0000] Established new database connection to "fdaa:0:1af2:a7b:a9a:0:3e2d:2:5433". source="postgres_exporter.go:970"
2021-10-03T16:20:28.169428456Z app[92becc29] lhr [info] checking stolon status
2021-10-03T16:20:28.729348725Z app[92becc29] lhr [info] exporter | ERRO[0001] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:1af2:a7b:a9a:0:3e2d:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:1af2:a7b:a9a:0:3e2d:2]:5433: connect: connection refused source="postgres_exporter.go:1658"
2021-10-03T16:20:28.862382934Z app[92becc29] lhr [info] sentinel | 2021-10-03T16:20:28.861Z INFO cmd/sentinel.go:82 Trying to acquire sentinels leadership
2021-10-03T16:20:34.846144971Z app[92becc29] lhr [info] exporter | INFO[0007] Established new database connection to "fdaa:0:1af2:a7b:a9a:0:3e2d:2:5433". source="postgres_exporter.go:970"
2021-10-03T16:20:35.846833408Z app[92becc29] lhr [info] exporter | ERRO[0008] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:1af2:a7b:a9a:0:3e2d:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:1af2:a7b:a9a:0:3e2d:2]:5433: connect: connection refused source="postgres_exporter.go:1658"
2021-10-03T16:20:37.352814411Z app[cc1b5664] lhr [info] proxy | 2021-10-03T16:20:37.352Z INFO cmd/proxy.go:304 check timeout timer fired
2021-10-03T16:20:42.728154485Z app[92becc29] lhr [info] exporter | INFO[0015] Established new database connection to "fdaa:0:1af2:a7b:a9a:0:3e2d:2:5433". source="postgres_exporter.go:970"
2021-10-03T16:20:43.728942388Z app[92becc29] lhr [info] exporter | ERRO[0016] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:1af2:a7b:a9a:0:3e2d:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:1af2:a7b:a9a:0:3e2d:2]:5433: connect: connection refused source="postgres_exporter.go:1658"
Our staging and production databases in ORD are both offline and have brought down our app. More information is posted in the other two two threads about the pg issues.