Last night my 2 node postgres cluster somehow crashed and is not recovering anymore. I took a look by using the status command and found out that the secondary node is missing and the primary node has 2 critical checks.
From the logs:
2022-04-30T08:25:09.247 app[9991bb75] fra [info] exporter | ERRO[0022] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:565c:a7b:23c6:0:cfaf:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:565c:a7b:23c6:0:cfaf:2]:5433: connect: connection refused source="postgres_exporter.go:1658"
2022-04-30T08:25:09.554 app[9991bb75] fra [info] sentinel | 2022-04-30T08:25:09.554Z WARN cmd/sentinel.go:276 no keeper info available {"db": "8ebc0edc", "keeper": "23c40cfb12"}
2022-04-30T08:25:14.730 app[9991bb75] fra [info] sentinel | 2022-04-30T08:25:14.729Z WARN cmd/sentinel.go:276 no keeper info available {"db": "8ebc0edc", "keeper": "23c40cfb12"}
2022-04-30T08:25:14.731 app[9991bb75] fra [info] sentinel | 2022-04-30T08:25:14.731Z INFO cmd/sentinel.go:995 master db is failed {"db": "8ebc0edc", "keeper": "23c40cfb12"}
2022-04-30T08:25:14.731 app[9991bb75] fra [info] sentinel | 2022-04-30T08:25:14.731Z INFO cmd/sentinel.go:1006 trying to find a new master to replace failed master
2022-04-30T08:25:14.731 app[9991bb75] fra [info] sentinel | 2022-04-30T08:25:14.731Z ERROR cmd/sentinel.go:1009 no eligible masters
Health checks:
role | critical | 9991bb75 | fra | HTTP | 16m3s ago | failed to connect to local node: context deadline exceeded[✓]
| | | | | |
| | | | | |
pg | critical | 9991bb75 | fra | HTTP | 16m6s ago | HTTP GET http://172.19.1.74:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"[✓]
What is the best way to recover the cluster? I tried resizing (also to 0 before) but there is no way to get the cluster up again.
Thanks