PostgreSQL Operational Updates

ben-io · July 30, 2024, 10:40pm

As we’ve been doing machine migrations, we’ve also had to migrate a bunch of Fly Postgres clusters, which has forced us to get better at keeping clusters running. Before migrating a Postgres primary, we now first do a failover so that a replica somewhere else becomes the new primary. The old primary now becomes a replica, and we can move that easily without downtime. We’ve also been seeing a handful of broken clusters, and in some cases, repairing them.

This exercise in machine migrations has helped us fix bugs in our postgres implementation. Most recently, we discovered some DB replicas were stuck in a loop logging database "repmgr" does not exist. Turns out this was happening to other replicas that weren’t migrated as well. We traced it down to a bug in our restore code. When you create a new database from a volume snapshot (fly pg create --snapshot-id vs_...) or volume fork (fly pg create --fork-from ...), and restore into a new multi-node cluster, postgres-flex was correctly wiping the metadata but skipping the re-initialization. The fix is here.

If you have a restored database replica that is logging that error message, you’ll need to delete that instance and recreate it. Your primary instance and your raw data should be unaffected.

Topic		Replies	Views
Postgres Cluster - Machine failure causes inconsistent repmgr state postgres , machines	1	32	January 22, 2025
Fresh Produce: Manual Failovers For PG Flex Apps Fresh Produce postgres	0	319	May 26, 2023
Postgres cluster broken since last Fly migration Questions / Help postgres	1	131	July 2, 2024
Postgres Flex - Important Stability Updates Fresh Produce postgres	12	1680	March 13, 2025
Can't recreate pg from snapshot: "The database was created using collation version 2.31, but the operating system provides version 2.36" Questions / Help postgres	3	541	May 3, 2024

PostgreSQL Operational Updates

Related topics