I encountered an issue with postgres-flex where in the UI, the machine role gets stuck as “Unknown” (instead of “Primary”) after an unplanned machine restart. Here’s what happened:
I had a 2-node PostgreSQL HA cluster running (I later realized 3 nodes are needed for auto-failover). The primary node ran out of RAM due to increased concurrent connections, likely triggered by reindex or autovacuum processes. This caused the machine to become unresponsive.
After stopping and restarting the machine through the Fly.io UI, the database came back up and worked perfectly fine. However, the Fly.io UI showed the machine role as “Unknown” instead of “Primary”, despite the repmgr tables showing the correct state. You can verify this by checking:
SELECT node_id, node_name, type, active FROM repmgr.nodes;
This suggests there’s a disconnect between repmgr’s internal state and Fly.io’s metadata layer when machines restart unexpectedly. I was able to fix it by manually updating the metadata:
fly machines update <machine-id> --metadata flypg_role=primary --app <app-id>
After this command, the UI correctly showed the machine as Primary again. While this isn’t a critical issue since the database works fine, it could cause confusion during incident response if the UI state doesn’t match the actual database state.
To reproduce:
- Have a primary node run out of memory
- Stop the machine via UI
- Restart it
- Check UI shows “Unknown” despite repmgr showing correct role
Hope this helps identify potential improvements in the metadata sync process between repmgr and Fly.io’s infrastructure layer.