PG stuck in a read only state

elliotdickison · August 29, 2024, 4:18pm

I’ve got a Postgres riddle for the gurus out there: we have a cluster that recently got stuck in a readonly state due to an unknown cause. When I sign into the primary node (via port :5432) with the postgres superuser and try to update a table, create a user, alter a role, etc. I get cannot execute X in a read-only transaction.

Here are the details:

3 machines running flyio/postgres-flex:15.7 (v0.0.60)
Fly health checks say the cluster is healthy
repmgr cluster show says the cluster is healthy
fly logs look normal and error free except for a bunch of cannot execute X in a read-only transaction triggered by a connected app trying to use the database (also using the postgres superuser)
SELECT pg_is_in_recovery(); returns false (but also logs a cannot execute UPDATE in a read-only transaction error)
any read-only queries succeed
Volumes are at ~10% capacity
A single-machine database restored from this one (using the new Tigris-based PITR feature) exhibits the exact same symptoms
fly pg restart does not help

Any ideas would be much appreciated!

elliotdickison · August 29, 2024, 5:15pm

Progress: we traced the cause of the problem back to a query to reindex the database. This temporarily put our primary volume over 90% capacity:

Health check for your postgres database has failed. Your database is malfunctioning. [✗] disk-capacity: 90.1% - readonly mode enabled, extend your volume to re-enable writes (173.05ms)

Setting aside the root cause (how did reindexing a small, super low traffic database get us from ~15% utilization of a 10GB volume to > 90%!?), we’re still not sure how to get the db back into a writeable state. Extending volumes per the log message didn’t help.

rgrove · September 3, 2024, 4:07pm

This just happened to my cluster as well, and I’m not sure how to recover. Did you ever figure it out?

elliotdickison · September 3, 2024, 4:28pm

Yes, we discovered a couple things:

The re-indexing was correlation not causation. In the same timeframe we disabled barman and enabled the new PITR. Something went wrong with this, causing disk space on our primary to steadily fill up over a period of a couple days. It appears that this space was not used by pg directly as our actual database size didn’t change. We still haven’t root caused this.
@Jackson1 discovered that SHOW default_transaction_read_only; was returning on for our cluster. He came up with these steps to make our DB writeable again:
- SET default_transaction_read_only TO off; to set default_transaction_readonly to off for your current session. This won’t fix the underlying issue, but it will let you run write queries for as long as you stay connected to the db.
- ALTER DATABASE [dbname] SET default_transaction_only TO off; to set default_transaction_only to off at the database level.
- ALTER USER [username] SET default_transaction_only TO off; to set default_transaction_only to off at the user level
- Use ALTER SYSTEM SET default_transaction_read_only TO off; to set default_transaction_only to off at the system level. This will add or update the setting in data/postgresql/postgresql.auto.conf which you can see if you ssh into the fly machine in question and run cat data/postgresql/postgresql.auto.conf.
- SELECT * from pg_settings WHERE name = 'default_transaction_read_only'; to show the row for the setting in the pg_settings table. A value of on for the reset_val column means that when the setting is reset it will be reset to on, which could be your problem. As a last resort you could try editing this table directly.

rgrove · September 3, 2024, 6:10pm

Thanks so much! This got me up and running again, and I’ve now modified my cluster’s auto_extend_size_threshold setting to extend volumes once they hit 50% usage, just to create some breathing room and hopefully prevent this from happening again (previously I had this set to 80%).

Elder · September 4, 2024, 6:28am

about two weeks ago I had a similar situation. I’ve upgraded the cluster to the latest flex to enable the new PITR, I had to kill Barman instance first, as it wasn’t supported by the newer image.

In a few hours the cluster switched to read only, the primary instance refused to restart and the failover procedure to assign another instance as a primary also failed. Investigation showed that the cause was the WAL file, which grown to 6 gb taking 90% of free space, though the max WAL size is set to 1.5 gb by default.

I assume the reason for that was the Barman replication slot stuck as active for some reason. Removing it manually and forcing to restart the primary finally got the WAL file back to normal size unlocking the cluster.

I’m aware that it’s a self managed PG, but I wish Fly provided more detailed/better tested instructions for migrations like this, as it relies on their setup

shaun · September 5, 2024, 2:54pm

@elliotdickison ~~Were your health checks failing? When your cluster enters read-only mode, it should provide some context within the health checks~~

Update, somehow missed one of your responses.

In the same timeframe we disabled barman and enabled the new PITR. Something went wrong with this, causing disk space on our primary to steadily fill up over a period of a couple days.

The separate Barman machine does use a replication slot, so if this was not cleaned up properly when it was removed it could lead to issues. You should receive warnings about inactive replication slots within your logs. Inactive replication slots should also be removed automatically after 12 hours.

Setting aside the root cause (how did reindexing a small, super low traffic database get us from ~15% utilization of a 10GB volume to > 90%!?), we’re still not sure how to get the db back into a writeable state. Extending volumes per the log message didn’t help.

How did you run the reindex command? Was it run with the CONCURRENTLY option?

If not, the exclusive lock taken out on the table could certainly impact replication and delay WAL generation.

Elder · September 5, 2024, 4:12pm

This could have been mentioned here Fresh Produce - Enhanced WAL Archiving and Remote Restores. Would have saved me a few hours of downtime and a stressful Sunday. And probably not only me.

mayailurus · September 6, 2024, 3:08pm

Added postgres, volumes

system · September 13, 2024, 3:08pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fly PG stuck in default_transaction_read_only = ON; postgres	2	77	January 23, 2025
Postgres not allowing any write operations Questions / Help postgres	3	479	December 14, 2023
V2 fail, PG read-only... Questions / Help postgres	1	177	July 17, 2023
Why can't I restart db? For some reason it does not work, although the status says running( Questions / Help postgres	6	460	December 7, 2022
Read-only replica for Postgres	1	589	January 21, 2022

PG stuck in a read only state

Related topics