Runaway pg_wal disk usage

Hi there. I respect that postgres is unmanaged, but I could use a pointer here from the community.

I have a single-node postgres fly app that is somewhat write-heavy. The database itself is less than 100MB, but it gets updated every 15 minutes with a bunch of upserts.

The fly app has been live for just 2 days and my pg_wal is almost 3GB. I haven’t done any special configuration. It is 2 shared CPU and 2048MB memory node. I provisioned a 10GB volume, but at this rate it may not last a week.

Could someone give me a pointer to manage this wal storage? Is it normal to have this much this soon? And is archiving an option here?

Hi @dustinfarris—that definitely seems like too much WAL for a 100 MB database. (Based on what you wrote, it sounds like it’s continually growing?)

It’s possible that something is preventing Postgres from removing old WAL files. The database will log information when it performs checkpoints, including how many WAL files it has added, removed, and recycled. You can look for this in the “Monitoring” page for your app or with fly logs. E.g.:

2023-12-28 21:54:29.207 UTC [379] LOG: checkpoint complete: wrote 6 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.003 s, sync=0.001 s, total=0.005 s; sync files=5, longest=0.001 s, average=0.001 s; distance=1 kB, estimate=1 kB

If this indicates that Postgres is adding but never removing or recycling WAL files, then you can at least be pretty confident that there’s a problem.

Here are some queries you could try (e.g. over fly pg connect) to look for potential issues that would cause this (I’m mostly copying from my response to a similar question here):

  • SELECT * FROM pg_replication_slots to see if any replication slots were accidentally created that might be preventing WAL cleanup. In particular, if you ever made a replica that you have since destroyed, this might be it.

  • SHOW max_wal_size to check that the soft limit for WAL size isn’t too high relative to the volume size. (Fly PG should automatically set to 10% of the disk space.)

  • SHOW wal_keep_size to check that keeping extra WAL files is disabled (should be 0).

  • SHOW archive_command and SHOW archive_library to see if there’s any WAL archiving enabled. If archiving is enabled but failing, then Postgres can’t remove WAL files.

Hope this helps!

Thanks for the reply. I do not see any logs regarding a checkpoint. All I see are these two logs repeated every 5 minutes:

2023-12-29T04:17:51Z app[9080625dc66398] sjc [info]monitor  | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2023-12-29T04:21:41Z app[9080625dc66398] sjc [info]repmgrd  | [2023-12-29 04:21:41] [INFO] monitoring primary node "fdaa:3:d862:a7b:247:2dd2:f6f6:2" (ID: 1704288533) in normal state
  • SELECT * FROM pg_replication_slots
    no results
  • SHOW max_wal_size
    • I scaled up my volume from 1GB to 10GB so this value seems fine. I’ve noticed on my other apps however that this value seems pretty static regardless of the volume size. A vanilla fly postgres create seems to set this at 1004MB along with the default 1GB volume it creates alongside it. I’ve not observed anything automatically being set to 10% of disk space.
  • SHOW wal_keep_size
  • SHOW archive_command
  • SHOW archive_library

Since I’m not seeing any mention of a checkpoint in the logs. Is it possible that this just isn’t happening?

I tried executing CHECKPOINT; myself, and it seemed to hang.

@MatthewIngwersen some additional info:

postgres=# select now();
 2023-12-29 20:01:09.202012+00
(1 row)

postgres=# SELECT * FROM pg_control_checkpoint() \gx
-[ RECORD 1 ]--------+-------------------------
checkpoint_lsn       | 1/7996D2A0
redo_lsn             | 1/7996D268
redo_wal_file        | 000000010000000100000079
timeline_id          | 1
prev_timeline_id     | 1
full_page_writes     | t
next_xid             | 0:10560
next_oid             | 32911
next_multixact_id    | 1
next_multi_offset    | 0
oldest_xid           | 717
oldest_xid_dbid      | 1
oldest_active_xid    | 10560
oldest_multi_xid     | 1
oldest_multi_dbid    | 1
oldest_commit_ts_xid | 0
newest_commit_ts_xid | 0
checkpoint_time      | 2023-12-28 00:27:53+00

It’s been ~40 hours since the last checkpoint was perfomed.

SOLVED! Restarting the postgres server has resolved this issue:

fly pg restart

The reboot forced a checkout which immediately cleared 6GB of wal, and postgres is now executing a checkpoint every 5 minutes as expected:

2023-12-29 20:20:29.027 UTC [377] LOG:  checkpoint starting: time
2023-12-29 20:24:59.064 UTC [377] LOG:  checkpoint complete: wrote 5243 buffers (8.0%); 0 WAL file(s) added, 2 removed, 0 recycled; write=270.032 s, sync=0.001 s, total=270.038 s; sync files=21, longest=0.001 s, average=0.001 s; distance=44688 kB, estimate=56845 kB
2023-12-29 20:25:29.094 UTC [377] LOG:  checkpoint starting: time
2023-12-29 20:25:29.097 UTC [377] LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.003 s; sync files=1, longest=0.001 s, average=0.001 s; distance=2 kB, estimate=51161 kB


Not solved :frowning:

I woke up this morning and checkpoints had stopped working. I saw this in the logs:

2023-12-30 00:45:32.538 UTC [377] LOG: checkpoint starting: time
2023-12-30 00:45:37.336 UTC [17664] ERROR: canceling statement due to user request

and then nothing after that — no further checkpoints.

Also the server crashed after running out of memory. Possibly related. Not sure why my 2GB memory postgres server would run out of memory on a less than 100MB database.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.