I was performing some maintenance on my DB and one of the queries I ran caused an OOM crash (only running on a 256MB node). After this point it seems something (stolon possibly?) has got into a bad state and I am unable to scale up to 2 healthy replicas.
At one point I had to scale down to 0, set FLY_RESTORED_FROM
, scale back up to 1 replica, this then worked but the when I scaled back up to 2 replicas the second one continued with errors.
The point I’m at now, I have a single replica running (and working). However there are continual errors in the console:
2023-05-04T00:12:24.475 app[8618679d] syd [info] sentinel | 2023-05-04T00:12:24.475Z WARN cmd/sentinel.go:276 no keeper info available {"db": "9814a974", "keeper": "298408a112"}
2023-05-04T00:12:34.964 app[8618679d] syd [info] sentinel | 2023-05-04T00:12:34.964Z WARN cmd/sentinel.go:276 no keeper info available {"db": "9814a974", "keeper": "298408a112"}
2023-05-04T00:12:40.279 app[8618679d] syd [info] sentinel | 2023-05-04T00:12:40.278Z WARN cmd/sentinel.go:276 no keeper info available {"db": "9814a974", "keeper": "298408a112"}
When I scale up to a second replica that gets these continual errors and the health checks never pass (with various repetitions/order in the logs):
2023-05-04T00:35:54.206 app[1a17d546] syd [info] checking stolon status
2023-05-04T00:35:54.487 app[1a17d546] syd [info] keeper | 2023-05-04T00:35:54.486Z ERROR cmd/keeper.go:719 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}
2023-05-04T00:35:55.086 app[1a17d546] syd [info] keeper | 2023-05-04T00:35:55.085Z ERROR cmd/keeper.go:1572 cannot move from master role to standby role
➜ fly checks list -a orchestrate-db
Health Checks for orchestrate-db
NAME | STATUS | ALLOCATION | REGION | TYPE | LAST UPDATED | OUTPUT
-----------------------------------*----------*------------*--------*------*--------------*------------------------------------------------------------------------------------------------------------------------------------------------------
pg | critical | d4d71072 | syd | HTTP | 1s ago | HTTP GET http://172.19.64.194:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect with local node: context deadline exceeded"[✓]
| | | | | |
| | | | | |
role | critical | d4d71072 | syd | HTTP | 10s ago | failed to connect to local node: context deadline exceeded[✓]
| | | | | |
| | | | | |
vm | passing | d4d71072 | syd | HTTP | 10s ago | HTTP GET http://172.19.64.194:5500/flycheck/vm: 200 OK Output: "[✓]
| | | | | |
| | | | | | checkDisk: 8.9 GB (91.0%) free space on /data/ (75.05µs)\n[✓]
| | | | | |
| | | | | | checkLoad: load averages: 0.00 0.00 0.00 (83.64µs)\n[✓]
| | | | | |
| | | | | | memory: system spent 0s of the last 60s waiting on memory (56.94µs)\n[✓]
| | | | | |
| | | | | | cpu: system spent 276ms of the last 60s waiting on cpu (33.62µs)\n[✓]
| | | | | |
| | | | | | io: system spent 234ms of the last 60s waiting on io (22.66µs)"[✓]
| | | | | |
| | | | | |
vm | passing | 8618679d | syd | HTTP | 3m36s ago | HTTP GET http://172.19.64.146:5500/flycheck/vm: 200 OK Output: "[✓]
| | | | | |
| | | | | | checkDisk: 8.58 GB (87.7%) free space on /data/ (48.21µs)\n[✓]
| | | | | |
| | | | | | checkLoad: load averages: 0.00 0.01 0.01 (52.31µs)\n[✓]
| | | | | |
| | | | | | memory: system spent 0s of the last 60s waiting on memory (35.81µs)\n[✓]
| | | | | |
| | | | | | cpu: system spent 858ms of the last 60s waiting on cpu (16.14µs)\n[✓]
| | | | | |
| | | | | | io: system spent 0s of the last 60s waiting on io (13.78µs)"[✓]
| | | | | |
| | | | | |
pg | passing | 8618679d | syd | HTTP | 4m1s ago | HTTP GET http://172.19.64.146:5500/flycheck/pg: 200 OK Output: "[✓]
| | | | | |
| | | | | | transactions: read/write (258.42µs)\n[✓]
| | | | | |
| | | | | | connections: 12 used, 3 reserved, 300 max (5.12ms)"[✓]
| | | | | |
| | | | | |
role | passing | 8618679d | syd | HTTP | 10h29m ago | leader[✓]
| | | | | |
| | | | | |
4f32e2c6c13e374db13c33d9ee3f6781 | passing | 891f1d94 | syd | TCP | 19m55s ago | TCP connect 172.19.71.138:5000: Success[✓]
| | | | | |
| | | | | |
4f32e2c6c13e374db13c33d9ee3f6781 | passing | b066696d | syd | TCP | 16h38m ago | TCP connect 172.19.74.146:5000: Success[✓]
| | | | | |
| | | | | |
I get the feeling that the data on the volume has been corrupted or got out of sync during the crash and the replica cannot reconcile with the new master due to this. I guess I could delete the whole volume and recreate it, then scale up again but it would be nice to understand what’s going on before doing this.
Also of note, this is using an old postgres image version (v0.0.26), I would like to update this but want to get into a good state first! I also would like to migrate to Apps V2 at some point, but, again want things in a good state before considering this!
Thanks for any help!