Postgres WAL Overflow without Replication volumes

So to save a little bit of text, I have the same issue described Here

The difference is while that user has 3 replication instances running, I have none… but Postgres is still creating a never ending pile of pg_wal files, eventually maxing out disk space.

I’ve just set my wal log level to minimal and tripled the volume size as a way to extend the timeframe I have until the machine chokes up again.

Any thoughts as to:

!. Why the wal generation is happening so aggressively
2. Why there is no cleanup occuring?

From my understanding, without a replication instance Postgres shouldn’t be aggresively creating these files, and maybe somehow i’ve triggered it to the think there are replication machines in the cluster?

Hi @TheFlyingCoder. This is an interesting problem. If you’d be willing to share, I’d be curious to know what version of Fly PG you’re running (fly image show), how large your volume was, and how much was being taken up by WAL.

I’m not a Postgres expert, but AFAIK, the database might produce a significant amount of WAL files even without replication. Since they keep a record of changes before they’re flushed to Postgres’s main table storage, an instance with a write-heavy workload could still create a lot of them even without replication enabled.

A few queries that I can think of to run with fly pg connect (some of which you may have already tried):

  • SELECT * FROM pg_replication_slots to see if any replication slots were accidentally created that might be preventing WAL cleanup.
  • SHOW max_wal_size to check that the soft limit for WAL size isn’t too high relative to the volume size. (The most current version of Fly PG configures this to be 10% of the disk size when the database is created.)
  • SHOW wal_keep_size to check that keeping extra WAL files is disabled (should be 0).
  • SHOW archive_command and SHOW archive_library to see if there’s any WAL archiving enabled that might be preventing WAL cleanup. (Not sure how this would have been turned on, but it’s quick to check anyway.)

(Several of these are described here in the official Postgres docs.)

Happy to share everything if it helps solve the issue here!

I’d be curious to know what version of Fly PG you’re running

flyio/postgres 14.6

the database might produce a significant amount of WAL files even without replication

I’d call 5GB of WAL files overnight a little more than significant, for a pretty low-use DB :eyes:

A few queries that I can think of to run with fly pg connect

I’ve actually not run any of the queries… but I must be brain dead and missing a command because it doesn’t response with anything just immediately breaks to a new line (Each line was run separately after each other):

postgres=# SELECT * FROM pg_replication_slots
postgres-# SHOW max_wal_size
postgres-# SHOW wal_keep_size
postgres-# SHOW archive_command and SHOW archive_library

Oh, sorry about that! The SQL prompt needs semicolons for each statement:

postgres=# SELECT * FROM pg_replication_slots;
postgres=# SHOW max_wal_size;
postgres=# SHOW wal_keep_size;
postgres=# SHOW archive_command;
postgres=# SHOW archive_library;

:roll_eyes: Gosh I should have seen that haha

Formatting sucks for the first one but i’ll drop each response:

SELECT * FROM pg_replication_slots;

    slot_name    | plugin | slot_type | datoid | database | temporary | active |
 active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_stat
us | safe_wal_size | two_phase
-----------------+--------+-----------+--------+----------+-----------+--------+
------------+------+--------------+-------------+---------------------+---------
---+---------------+-----------
 stolon_aad02ae2 |        | physical  |        |          | f         | f      |
            |      |              | 9B/B6000060 |                     | extended
   |               | f

SHOW max_wal_size;

 max_wal_size
--------------
 97MB

SHOW wal_keep_size;

 wal_keep_size
---------------
 0

SHOW archive_command;

                         archive_command
------------------------------------------------------------------
 if [ $ENABLE_WALG ]; then /usr/local/bin/wal-g wal-push "%p"; fi
(1 row)

SHOW archive_library;

ERROR:  unrecognized configuration parameter "archive_library"

Thanks @TheFlyingCoder, this is very helpful!

Here’s a cleaned-up version of the replication slot output:

    slot_name    | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
-----------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
 stolon_aad02ae2 |        | physical  |        |          | f         | f      |            |      |              | 9B/B6000060 |                     | extended   |               | f

The version of Fly PG that you’re using includes Stolon to manage replication and high availability. It looks like it created a replication slot for some reason, so PG is retaining WAL for a replica that doesn’t exist. (If you created more than one PG instance in this app and subsequently deleted all but one, this might be how it was created.)

Below I’ve added a few suggestions to address this, but remember that this isn’t managed Postgres, so it’s ultimately up to you to decide what to do and to handle any issues that arise.

Before doing anything, I’d suggest verifying (e.g. with fly status) that there really are no replicas in your app, and making sure that you have a recent database backup available just in case.

Then in your logs, do you see no keeper info available messages appearing frequently? If so, Stolon still thinks that a replica exists. You might try following the instructions in this thread (just be very careful when identifying the correct keeper to remove!):

There’s additional info about this here:

If you don’t see any no keeper info available messages, it may be enough to delete the replication slot using this SQL query with fly pg connect:

SELECT pg_drop_replication_slot('stolon_aad02ae2');

Hope this helps!

1 Like

This was EXACTLY the issue!

I’m surprised that topic didn’t come up in any of my searches :eyes:

You are right I’m pretty sure that a new Postgres machine was initialised by accident and deleted, I do have a query though… I thought you had to configure a new instance of postgres to act as a replica? i.e attach volumes, handle config etc.

I’m about to follow the steps now i’ll let you know how it goes!

Successfully cleared the failed keeper.

It suggests that the issue still exists today in v0.0.41:

We discovered an issue with Stolon that was preventing failed keepers from getting cleaned up as expected. This issue has been patched and addressed with release v0.0.29 .

deadKeeperRemovalInterval

Regardless I will keep an eye on my disk size for a few days, and scale it back down to the right size if it doesn’t jump back up again!

Thanks for the help @MatthewIngwersen!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.