Woke up this morning to a db outage and some pretty exciting log messages from the leader of our multi-region Postgres:
2022-07-13T16:21:20.108 app[be3a612f] sea [info] keeper | 2022-07-13T16:21:20.108Z INFO postgresql/postgresql.go:319 starting database
2022-07-13T16:21:20.121 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.120 UTC [844] LOG: starting PostgreSQL 13.4 (Debian 13.4-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2022-07-13T16:21:20.121 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.121 UTC [844] LOG: listening on IPv6 address "fdaa:0:22f0:a7b:2d30:0:9c2e:2", port 5433
2022-07-13T16:21:20.122 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.121 UTC [844] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"
2022-07-13T16:21:20.128 app[be3a612f] sea [info] [ 114.190858] EXT4-fs (vdc): Delayed block allocation failed for inode 524837 at logical offset 20640 with max blocks 2 with error 117
2022-07-13T16:21:20.128 app[be3a612f] sea [info] [ 114.192047] EXT4-fs (vdc): This should not happen!! Data will be lost
2022-07-13T16:21:20.128 app[be3a612f] sea [info] [ 114.192047]
2022-07-13T16:21:20.129 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.124 UTC [845] LOG: database system was interrupted while in recovery at log time 2022-07-02 08:09:08 UTC
2022-07-13T16:21:20.129 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.124 UTC [845] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2022-07-13T16:21:20.129 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.128 UTC [845] PANIC: could not flush dirty data: Structure needs cleaning
2022-07-13T16:21:20.129 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.129 UTC [844] LOG: startup process (PID 845) was terminated by signal 6: Aborted
2022-07-13T16:21:20.129 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.129 UTC [844] LOG: aborting startup due to startup process failure
2022-07-13T16:21:20.131 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.131 UTC [844] LOG: database system is shut down
2022-07-13T16:21:20.311 app[be3a612f] sea [info] keeper | 2022-07-13T16:21:20.308Z ERROR cmd/keeper.go:1585 failed to start postgres {"error": "postgres exited unexpectedly"}
Restarting/replacing the VM doesn’t help, I assume because there’s an issue with our volume. Anyone have ideas on how to fix this? I’m ok losing a few hours of data (this started 6hrs ago) if I can just get back up and running. Our replicas seem ok - would the best strategy be to just make one of those the leader? Or is there a way I can fix the volume for my leader? Also, any ideas how this might have happened in the first place?