As we’ve been doing machine migrations, we’ve also had to migrate a bunch of Fly Postgres clusters, which has forced us to get better at keeping clusters running. Before migrating a Postgres primary, we now first do a failover so that a replica somewhere else becomes the new primary. The old primary now becomes a replica, and we can move that easily without downtime. We’ve also been seeing a handful of broken clusters, and in some cases, repairing them.
This exercise in machine migrations has helped us fix bugs in our postgres implementation. Most recently, we discovered some DB replicas were stuck in a loop logging database "repmgr" does not exist
. Turns out this was happening to other replicas that weren’t migrated as well. We traced it down to a bug in our restore code. When you create a new database from a volume snapshot (fly pg create --snapshot-id vs_...)
or volume fork (fly pg create --fork-from ...
), and restore into a new multi-node cluster, postgres-flex was correctly wiping the metadata but skipping the re-initialization. The fix is here.
If you have a restored database replica that is logging that error message, you’ll need to delete that instance and recreate it. Your primary instance and your raw data should be unaffected.