I’ve got a Postgres riddle for the gurus out there: we have a cluster that recently got stuck in a readonly state due to an unknown cause. When I sign into the primary node (via port :5432) with the postgres superuser and try to update a table, create a user, alter a role, etc. I get cannot execute X in a read-only transaction.
fly logs look normal and error free except for a bunch of cannot execute X in a read-only transaction triggered by a connected app trying to use the database (also using the postgres superuser)
SELECT pg_is_in_recovery(); returns false (but also logs a cannot execute UPDATE in a read-only transaction error)
any read-only queries succeed
Volumes are at ~10% capacity
A single-machine database restored from this one (using the new Tigris-based PITR feature) exhibits the exact same symptoms
Progress: we traced the cause of the problem back to a query to reindex the database. This temporarily put our primary volume over 90% capacity:
Health check for your postgres database has failed. Your database is malfunctioning. [✗] disk-capacity: 90.1% - readonly mode enabled, extend your volume to re-enable writes (173.05ms)
Setting aside the root cause (how did reindexing a small, super low traffic database get us from ~15% utilization of a 10GB volume to > 90%!?), we’re still not sure how to get the db back into a writeable state. Extending volumes per the log message didn’t help.
The re-indexing was correlation not causation. In the same timeframe we disabled barman and enabled the new PITR. Something went wrong with this, causing disk space on our primary to steadily fill up over a period of a couple days. It appears that this space was not used by pg directly as our actual database size didn’t change. We still haven’t root caused this.
@Jackson1 discovered that SHOW default_transaction_read_only; was returning on for our cluster. He came up with these steps to make our DB writeable again:
SET default_transaction_read_only TO off; to set default_transaction_readonly to off for your current session. This won’t fix the underlying issue, but it will let you run write queries for as long as you stay connected to the db.
ALTER DATABASE [dbname] SET default_transaction_only TO off; to set default_transaction_only to off at the database level.
ALTER USER [username] SET default_transaction_only TO off; to set default_transaction_only to off at the user level
Use ALTER SYSTEM SET default_transaction_read_only TO off; to set default_transaction_only to off at the system level. This will add or update the setting in data/postgresql/postgresql.auto.conf which you can see if you ssh into the fly machine in question and run cat data/postgresql/postgresql.auto.conf.
SELECT * from pg_settings WHERE name = 'default_transaction_read_only'; to show the row for the setting in the pg_settings table. A value of on for the reset_val column means that when the setting is reset it will be reset to on, which could be your problem. As a last resort you could try editing this table directly.
Thanks so much! This got me up and running again, and I’ve now modified my cluster’s auto_extend_size_threshold setting to extend volumes once they hit 50% usage, just to create some breathing room and hopefully prevent this from happening again (previously I had this set to 80%).
about two weeks ago I had a similar situation. I’ve upgraded the cluster to the latest flex to enable the new PITR, I had to kill Barman instance first, as it wasn’t supported by the newer image.
In a few hours the cluster switched to read only, the primary instance refused to restart and the failover procedure to assign another instance as a primary also failed. Investigation showed that the cause was the WAL file, which grown to 6 gb taking 90% of free space, though the max WAL size is set to 1.5 gb by default.
I assume the reason for that was the Barman replication slot stuck as active for some reason. Removing it manually and forcing to restart the primary finally got the WAL file back to normal size unlocking the cluster.
I’m aware that it’s a self managed PG, but I wish Fly provided more detailed/better tested instructions for migrations like this, as it relies on their setup
@elliotdickisonWere your health checks failing? When your cluster enters read-only mode, it should provide some context within the health checks
Update, somehow missed one of your responses.
In the same timeframe we disabled barman and enabled the new PITR. Something went wrong with this, causing disk space on our primary to steadily fill up over a period of a couple days.
The separate Barman machine does use a replication slot, so if this was not cleaned up properly when it was removed it could lead to issues. You should receive warnings about inactive replication slots within your logs. Inactive replication slots should also be removed automatically after 12 hours.
Setting aside the root cause (how did reindexing a small, super low traffic database get us from ~15% utilization of a 10GB volume to > 90%!?), we’re still not sure how to get the db back into a writeable state. Extending volumes per the log message didn’t help.
How did you run the reindex command? Was it run with the CONCURRENTLY option?
If not, the exclusive lock taken out on the table could certainly impact replication and delay WAL generation.