Runaway pg_wal disk usage

Hi @dustinfarris—that definitely seems like too much WAL for a 100 MB database. (Based on what you wrote, it sounds like it’s continually growing?)

It’s possible that something is preventing Postgres from removing old WAL files. The database will log information when it performs checkpoints, including how many WAL files it has added, removed, and recycled. You can look for this in the “Monitoring” page for your app or with fly logs. E.g.:

2023-12-28 21:54:29.207 UTC [379] LOG: checkpoint complete: wrote 6 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.003 s, sync=0.001 s, total=0.005 s; sync files=5, longest=0.001 s, average=0.001 s; distance=1 kB, estimate=1 kB

If this indicates that Postgres is adding but never removing or recycling WAL files, then you can at least be pretty confident that there’s a problem.

Here are some queries you could try (e.g. over fly pg connect) to look for potential issues that would cause this (I’m mostly copying from my response to a similar question here):

  • SELECT * FROM pg_replication_slots to see if any replication slots were accidentally created that might be preventing WAL cleanup. In particular, if you ever made a replica that you have since destroyed, this might be it.

  • SHOW max_wal_size to check that the soft limit for WAL size isn’t too high relative to the volume size. (Fly PG should automatically set to 10% of the disk space.)

  • SHOW wal_keep_size to check that keeping extra WAL files is disabled (should be 0).

  • SHOW archive_command and SHOW archive_library to see if there’s any WAL archiving enabled. If archiving is enabled but failing, then Postgres can’t remove WAL files.

Hope this helps!