Postgres Cluster - Machine failure causes inconsistent repmgr state

diego9182 · January 15, 2025, 9:43pm

I encountered an issue with postgres-flex where in the UI, the machine role gets stuck as “Unknown” (instead of “Primary”) after an unplanned machine restart. Here’s what happened:

I had a 2-node PostgreSQL HA cluster running (I later realized 3 nodes are needed for auto-failover). The primary node ran out of RAM due to increased concurrent connections, likely triggered by reindex or autovacuum processes. This caused the machine to become unresponsive.

After stopping and restarting the machine through the Fly.io UI, the database came back up and worked perfectly fine. However, the Fly.io UI showed the machine role as “Unknown” instead of “Primary”, despite the repmgr tables showing the correct state. You can verify this by checking:

SELECT node_id, node_name, type, active FROM repmgr.nodes;

This suggests there’s a disconnect between repmgr’s internal state and Fly.io’s metadata layer when machines restart unexpectedly. I was able to fix it by manually updating the metadata:

fly machines update <machine-id> --metadata flypg_role=primary --app <app-id>

After this command, the UI correctly showed the machine as Primary again. While this isn’t a critical issue since the database works fine, it could cause confusion during incident response if the UI state doesn’t match the actual database state.

To reproduce:

Have a primary node run out of memory
Stop the machine via UI
Restart it
Check UI shows “Unknown” despite repmgr showing correct role

Hope this helps identify potential improvements in the metadata sync process between repmgr and Fly.io’s infrastructure layer.

system · January 22, 2025, 9:43pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
PostgreSQL Operational Updates Fresh Produce postgres	0	233	July 30, 2024
Here's how to fix an unreachable (2 Zombie, 1 Replica) HA Postgres Cluster Show & Tell postgres	13	1158	June 5, 2025
Postgres cluster broken since last Fly migration Questions / Help postgres	1	131	July 2, 2024
Improved Postgres Clustering with repmgr - Preview Fresh Produce postgres	37	4521	April 26, 2023
Postgres cluster deploy aborted after image update from 14.4 v0.0.28 to 14.4 v0.0.31	4	242	October 4, 2022

Postgres Cluster - Machine failure causes inconsistent repmgr state

Related topics