Postgres "Delayed block allocation failed"

Woke up this morning to a db outage and some pretty exciting log messages from the leader of our multi-region Postgres:

2022-07-13T16:21:20.108 app[be3a612f] sea [info] keeper | 2022-07-13T16:21:20.108Z INFO postgresql/postgresql.go:319 starting database

2022-07-13T16:21:20.121 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.120 UTC [844] LOG: starting PostgreSQL 13.4 (Debian 13.4-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

2022-07-13T16:21:20.121 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.121 UTC [844] LOG: listening on IPv6 address "fdaa:0:22f0:a7b:2d30:0:9c2e:2", port 5433

2022-07-13T16:21:20.122 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.121 UTC [844] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"

2022-07-13T16:21:20.128 app[be3a612f] sea [info] [ 114.190858] EXT4-fs (vdc): Delayed block allocation failed for inode 524837 at logical offset 20640 with max blocks 2 with error 117

2022-07-13T16:21:20.128 app[be3a612f] sea [info] [ 114.192047] EXT4-fs (vdc): This should not happen!! Data will be lost

2022-07-13T16:21:20.128 app[be3a612f] sea [info] [ 114.192047]

2022-07-13T16:21:20.129 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.124 UTC [845] LOG: database system was interrupted while in recovery at log time 2022-07-02 08:09:08 UTC

2022-07-13T16:21:20.129 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.124 UTC [845] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.

2022-07-13T16:21:20.129 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.128 UTC [845] PANIC: could not flush dirty data: Structure needs cleaning

2022-07-13T16:21:20.129 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.129 UTC [844] LOG: startup process (PID 845) was terminated by signal 6: Aborted

2022-07-13T16:21:20.129 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.129 UTC [844] LOG: aborting startup due to startup process failure

2022-07-13T16:21:20.131 app[be3a612f] sea [info] keeper | 2022-07-13 16:21:20.131 UTC [844] LOG: database system is shut down

2022-07-13T16:21:20.311 app[be3a612f] sea [info] keeper | 2022-07-13T16:21:20.308Z ERROR cmd/keeper.go:1585 failed to start postgres {"error": "postgres exited unexpectedly"}

Restarting/replacing the VM doesn’t help, I assume because there’s an issue with our volume. Anyone have ideas on how to fix this? I’m ok losing a few hours of data (this started 6hrs ago) if I can just get back up and running. Our replicas seem ok - would the best strategy be to just make one of those the leader? Or is there a way I can fix the volume for my leader? Also, any ideas how this might have happened in the first place?

Ah, I think my volume is full :grimacing:

He we have a feature for this now. Give it a try: Volume expansion is now available!

1 Like

Amazing feature, thank you. Unfortunately that revealed that disk space wasn’t the issue - or that running out of disk space caused an issue that expanding the volume on its own doesn’t fix.

Hrm, you’re right. Looking closer at the error it seems like you might have a bad block on your disk. We’ll look into it now.

Does this DB have a replica? If so, the simplest thing for you is going to be to delete the volume, then create a new one. We keep deleted volumes around for 24 so we’ll be able to investigate still.

I confirmed that the problem started with our original leader running out of disk space, or at least it was using significantly more space than any of the replicas. The leader filled up a 10GB volume while the replicas were only using 1.3GB. There was very little replication lag up until the leader failed and the replicas were not missing any significant amount of data, so something other than our DB seems to have filled up the leader’s volume. Extending the leader’s volume didn’t help. We’re unsure if running out of space caused a problem that persisted on disk or if our frantic restarting/replacing of the vm caused a problem. We were able to get things running again by making one of the replicas the new leader and deleting the original leader’s volume. In hindsight we definitely should have held on to the original volume for inspection, but we were focusing entirely on stemming the flood of errors from our clients and avoiding problems on any other volumes.

We can look at the volume!

The disk space is likely getting used by the write ahead log. It’s basically a transaction history the replicas use to stay up to date. This is a problem on active DBs with smallish volumes, and we shipped the volumes extend feature as a first step towards fixing it. The actual fix here is to let the volumes auto expand when there’s disk pressure.

For now, your best bet is to use much larger volumes. I’d set them at like 50GB each just to be safe.

We went overboard and are at 100GB now :wink: Great, the failing volume was vol_mjn924oqmzxv03lq. Thanks!

1 Like