Hi there. I respect that postgres is unmanaged, but I could use a pointer here from the community.
I have a single-node postgres fly app that is somewhat write-heavy. The database itself is less than 100MB, but it gets updated every 15 minutes with a bunch of upserts.
The fly app has been live for just 2 days and my pg_wal is almost 3GB. I haven’t done any special configuration. It is 2 shared CPU and 2048MB memory node. I provisioned a 10GB volume, but at this rate it may not last a week.
Could someone give me a pointer to manage this wal storage? Is it normal to have this much this soon? And is archiving an option here?
Hi @dustinfarris—that definitely seems like too much WAL for a 100 MB database. (Based on what you wrote, it sounds like it’s continually growing?)
It’s possible that something is preventing Postgres from removing old WAL files. The database will log information when it performs checkpoints, including how many WAL files it has added, removed, and recycled. You can look for this in the “Monitoring” page for your app or with fly logs. E.g.:
If this indicates that Postgres is adding but never removing or recycling WAL files, then you can at least be pretty confident that there’s a problem.
Here are some queries you could try (e.g. over fly pg connect) to look for potential issues that would cause this (I’m mostly copying from my response to a similar question here):
SELECT * FROM pg_replication_slots to see if any replication slots were accidentally created that might be preventing WAL cleanup. In particular, if you ever made a replica that you have since destroyed, this might be it.
SHOW max_wal_size to check that the soft limit for WAL size isn’t too high relative to the volume size. (Fly PG should automatically set to 10% of the disk space.)
SHOW wal_keep_size to check that keeping extra WAL files is disabled (should be 0).
SHOW archive_command and SHOW archive_library to see if there’s any WAL archiving enabled. If archiving is enabled but failing, then Postgres can’t remove WAL files.
I scaled up my volume from 1GB to 10GB so this value seems fine. I’ve noticed on my other apps however that this value seems pretty static regardless of the volume size. A vanilla fly postgres create seems to set this at 1004MB along with the default 1GB volume it creates alongside it. I’ve not observed anything automatically being set to 10% of disk space.
SHOW wal_keep_size 0
SHOW archive_command (disabled)
SHOW archive_library
empty
Since I’m not seeing any mention of a checkpoint in the logs. Is it possible that this just isn’t happening?
I tried executing CHECKPOINT; myself, and it seemed to hang.
I woke up this morning and checkpoints had stopped working. I saw this in the logs:
2023-12-30 00:45:32.538 UTC [377] LOG: checkpoint starting: time
2023-12-30 00:45:37.336 UTC [17664] ERROR: canceling statement due to user request
and then nothing after that — no further checkpoints.
Also the server crashed after running out of memory. Possibly related. Not sure why my 2GB memory postgres server would run out of memory on a less than 100MB database.