Fly Postgres v1 replication errors after OOM

juliancarrivick · May 4, 2023, 12:47am

I was performing some maintenance on my DB and one of the queries I ran caused an OOM crash (only running on a 256MB node). After this point it seems something (stolon possibly?) has got into a bad state and I am unable to scale up to 2 healthy replicas.

At one point I had to scale down to 0, set FLY_RESTORED_FROM, scale back up to 1 replica, this then worked but the when I scaled back up to 2 replicas the second one continued with errors.

The point I’m at now, I have a single replica running (and working). However there are continual errors in the console:

2023-05-04T00:12:24.475 app[8618679d] syd [info] sentinel | 2023-05-04T00:12:24.475Z WARN	cmd/sentinel.go:276 no keeper info available {"db": "9814a974", "keeper": "298408a112"}

2023-05-04T00:12:34.964 app[8618679d] syd [info] sentinel | 2023-05-04T00:12:34.964Z WARN	cmd/sentinel.go:276 no keeper info available {"db": "9814a974", "keeper": "298408a112"}

2023-05-04T00:12:40.279 app[8618679d] syd [info] sentinel | 2023-05-04T00:12:40.278Z WARN	cmd/sentinel.go:276 no keeper info available {"db": "9814a974", "keeper": "298408a112"}

When I scale up to a second replica that gets these continual errors and the health checks never pass (with various repetitions/order in the logs):

2023-05-04T00:35:54.206 app[1a17d546] syd [info] checking stolon status

2023-05-04T00:35:54.487 app[1a17d546] syd [info] keeper | 2023-05-04T00:35:54.486Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}

2023-05-04T00:35:55.086 app[1a17d546] syd [info] keeper | 2023-05-04T00:35:55.085Z ERROR	cmd/keeper.go:1572 cannot move from master role to standby role

➜ fly checks list -a orchestrate-db  
Health Checks for orchestrate-db
  NAME                             | STATUS   | ALLOCATION | REGION | TYPE | LAST UPDATED | OUTPUT                                                                                                                                               
-----------------------------------*----------*------------*--------*------*--------------*------------------------------------------------------------------------------------------------------------------------------------------------------
  pg                               | critical | d4d71072   | syd    | HTTP | 1s ago       | HTTP GET http://172.19.64.194:5500/flycheck/pg: 500 Internal Server Error Output: "failed to connect with local node: context deadline exceeded"[✓]  
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              |                                                                                                                                                      
  role                             | critical | d4d71072   | syd    | HTTP | 10s ago      | failed to connect to local node: context deadline exceeded[✓]                                                                                        
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              |                                                                                                                                                      
  vm                               | passing  | d4d71072   | syd    | HTTP | 10s ago      | HTTP GET http://172.19.64.194:5500/flycheck/vm: 200 OK Output: "[✓]                                                                                  
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              | checkDisk: 8.9 GB (91.0%) free space on /data/ (75.05µs)\n[✓]                                                                                        
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              | checkLoad: load averages: 0.00 0.00 0.00 (83.64µs)\n[✓]                                                                                              
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              | memory: system spent 0s of the last 60s waiting on memory (56.94µs)\n[✓]                                                                             
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              | cpu: system spent 276ms of the last 60s waiting on cpu (33.62µs)\n[✓]                                                                                
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              | io: system spent 234ms of the last 60s waiting on io (22.66µs)"[✓]                                                                                   
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              |                                                                                                                                                      
  vm                               | passing  | 8618679d   | syd    | HTTP | 3m36s ago    | HTTP GET http://172.19.64.146:5500/flycheck/vm: 200 OK Output: "[✓]                                                                                  
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              | checkDisk: 8.58 GB (87.7%) free space on /data/ (48.21µs)\n[✓]                                                                                       
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              | checkLoad: load averages: 0.00 0.01 0.01 (52.31µs)\n[✓]                                                                                              
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              | memory: system spent 0s of the last 60s waiting on memory (35.81µs)\n[✓]                                                                             
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              | cpu: system spent 858ms of the last 60s waiting on cpu (16.14µs)\n[✓]                                                                                
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              | io: system spent 0s of the last 60s waiting on io (13.78µs)"[✓]                                                                                      
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              |                                                                                                                                                      
  pg                               | passing  | 8618679d   | syd    | HTTP | 4m1s ago     | HTTP GET http://172.19.64.146:5500/flycheck/pg: 200 OK Output: "[✓]                                                                                  
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              | transactions: read/write (258.42µs)\n[✓]                                                                                                             
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              | connections: 12 used, 3 reserved, 300 max (5.12ms)"[✓]                                                                                               
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              |                                                                                                                                                      
  role                             | passing  | 8618679d   | syd    | HTTP | 10h29m ago   | leader[✓]                                                                                                                                            
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              |                                                                                                                                                      
  4f32e2c6c13e374db13c33d9ee3f6781 | passing  | 891f1d94   | syd    | TCP  | 19m55s ago   | TCP connect 172.19.71.138:5000: Success[✓]                                                                                                           
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              |                                                                                                                                                      
  4f32e2c6c13e374db13c33d9ee3f6781 | passing  | b066696d   | syd    | TCP  | 16h38m ago   | TCP connect 172.19.74.146:5000: Success[✓]                                                                                                           
                                   |          |            |        |      |              |                                                                                                                                                      
                                   |          |            |        |      |              |

I get the feeling that the data on the volume has been corrupted or got out of sync during the crash and the replica cannot reconcile with the new master due to this. I guess I could delete the whole volume and recreate it, then scale up again but it would be nice to understand what’s going on before doing this.

Also of note, this is using an old postgres image version (v0.0.26), I would like to update this but want to get into a good state first! I also would like to migrate to Apps V2 at some point, but, again want things in a good state before considering this!

Thanks for any help!

juliancarrivick · May 7, 2023, 3:04am

For completeness, I didn’t ever figure exactly what went wrong, but I ended up fixing it by recreating the volume:

Figured out which volume was the one in use: fly volumes list
Created a new one to be used fly volumes create pg_data --region syd --size 10
Scaled back up (and luckily the new VM used the new volume): fly scale count 2
Deleted the dud volume: fly volumes destroy $vol_id

I was then still left with the no keeper info available logs, but according to this post, this is fixed in a newer postgres image. So I updated using fly image update and hopefully they will stop after ~48 hours.

system · May 14, 2023, 3:04am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Postgres clusters periodically down across many of our organizations Questions / Help postgres	7	1666	October 13, 2022
anatomy of a postgres outage Questions / Help postgres	5	1030	November 8, 2022
No Keeper Available, runaway WAL Questions / Help postgres	12	1170	July 31, 2023
Postgres app stuck on the "un-mounting volume" stage postgres	6	260	October 27, 2022
Postgres crashed - cannot restart, restore or fork Questions / Help postgres	5	437	November 10, 2023

Fly Postgres v1 replication errors after OOM

Related topics