Postgres Cluster Crashed and can't recover

Last night my 2 node postgres cluster somehow crashed and is not recovering anymore. I took a look by using the status command and found out that the secondary node is missing and the primary node has 2 critical checks.

From the logs:

2022-04-30T08:25:09.247 app[9991bb75] fra [info] exporter | ERRO[0022] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:565c:a7b:23c6:0:cfaf:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:565c:a7b:23c6:0:cfaf:2]:5433: connect: connection refused  source="postgres_exporter.go:1658"

2022-04-30T08:25:09.554 app[9991bb75] fra [info] sentinel | 2022-04-30T08:25:09.554Z WARN cmd/sentinel.go:276 no keeper info available {"db": "8ebc0edc", "keeper": "23c40cfb12"}

2022-04-30T08:25:14.730 app[9991bb75] fra [info] sentinel | 2022-04-30T08:25:14.729Z WARN cmd/sentinel.go:276 no keeper info available {"db": "8ebc0edc", "keeper": "23c40cfb12"}

2022-04-30T08:25:14.731 app[9991bb75] fra [info] sentinel | 2022-04-30T08:25:14.731Z INFO cmd/sentinel.go:995 master db is failed {"db": "8ebc0edc", "keeper": "23c40cfb12"}

2022-04-30T08:25:14.731 app[9991bb75] fra [info] sentinel | 2022-04-30T08:25:14.731Z INFO cmd/sentinel.go:1006 trying to find a new master to replace failed master

2022-04-30T08:25:14.731 app[9991bb75] fra [info] sentinel | 2022-04-30T08:25:14.731Z ERROR cmd/sentinel.go:1009 no eligible masters

Health checks:

  role | critical | 9991bb75   | fra    | HTTP | 16m3s ago    | failed to connect to local node: context deadline exceeded[✓]
       |          |            |        |      |              |
       |          |            |        |      |              |
  pg   | critical | 9991bb75   | fra    | HTTP | 16m6s ago    | HTTP GET http://172.19.1.74:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect to proxy: context deadline exceeded"[✓]

What is the best way to recover the cluster? I tried resizing (also to 0 before) but there is no way to get the cluster up again.

Thanks

I’m looking at this! Give me a bit and I’ll let you know.

It looks like there was a new volume from earlier today. Did you add that as part of troubleshooting? I removed it temporarily since it wasn’t valid data for the cluster.

It looks like this database outgrew its disk. I resized the disk to 20GB, but it appears to have corrupted the data files. We’re going to see if we can recover from that, give us a bit.

yes I did this as a part of troubleshooting trying to spin up a new vm.

Thanks a ton for your help, I’m really in love with Fly.
It would be amazing if we could notifications if a Volume is almost full or show it in the dashboard. I wasn’t aware of this!