Just waking up to this. Widespread reports of our app being offline.
Notes:
A specific DB machine was created 2 hours ago (automatically) and all the failed connections appear to be to iut
That machine itself doesn’t seem to actually be connected to the DB which is almost certainly the problem
Problem seems more widespread - replica manager seems borked because of machine change. Working on getting a primary running smooth to migrate everything over to
Hi Kevin. I was looking at your unmanaged postgres cluster, seems that only one node was migrated over to SJC and that caused all clients (already in SJC) to attempt to connect to it.
Now the migration of all postgres nodes to SJC have completed, the most important is that the primary moved over so clients should be connecting normally.
I see your app has restored its traffic, no more 500 status code responses, although not to normal levels.
Hi there, I’m one of the engineers on the same application who looked into this this morning.
What we ended up doing to resolve ASAP was verify that the primary was functional, and then pointed our application specifically at that instance - ie all our app instances are now using the one machine exclusively
We are still seeing logs to suggest that our replicas are not working correctly - see the following snippet from one of our replicas:
I looked at that error but seems unrelated to today events.
That standby node has been lagging behind for 245 days (21171421 seconds).
postgres@2865949f526758:~$ repmgr node check
WARNING: node "2865949f526758" not found in "pg_stat_replication"
Node "568306df6900e8":
Server role: OK (node is standby)
Replication lag: CRITICAL (21171421 seconds, critical threshold: 600))
WAL archiving: OK (0 pending archive ready files)
Upstream connection: CRITICAL (node "568306df6900e8" (ID: 305378526) is not attached to expected upstream node "7811ed4b4e3238" (ID: 1688476492))
Downstream servers: OK (this node has no downstream nodes)
Replication slots: OK (node has no physical replication slots)
Missing physical replication slots: OK (node has no missing physical replication slots)
Configured data directory: OK (configured "data_directory" is "/data/postgresql")
Not a big deal because there are two more nodes replicating fine.
postgres@7811ed4b4e3238:~$ repmgr cluster show
WARNING: node "2865949f526758" not found in "pg_stat_replication"
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
------------+----------------+---------+-----------+------------------+----------+----------+----------+---------------------------------------------------------------------------------------------------
305378526 | 2865949f526758 | standby | ! running | ! 7811ed4b4e3238 | sjc | 100 | 3 | host=2865949f526758.vm.dispatch-db.internal port=5433 user=repmgr dbname=repmgr connect_timeout=5
1132989670 | 683dd95f19d478 | standby | running | 7811ed4b4e3238 | atl | 0 | 3 | host=683dd95f19d478.vm.dispatch-db.internal port=5433 user=repmgr dbname=repmgr connect_timeout=5
1284576639 | e825d09a035658 | standby | running | 7811ed4b4e3238 | sjc | 100 | 3 | host=e825d09a035658.vm.dispatch-db.internal port=5433 user=repmgr dbname=repmgr connect_timeout=5
1688476492 | 7811ed4b4e3238 | primary | * running | | sjc | 100 | 3 | host=7811ed4b4e3238.vm.dispatch-db.internal port=5433 user=repmgr dbname=repmgr connect_timeout=5
WARNING: following issues were detected
- node "2865949f526758" (ID: 305378526) is running but the repmgr node record is inactive
- node "2865949f526758" (ID: 305378526) is not attached to its upstream node "7811ed4b4e3238" (ID: 1688476492)
btw, 568306df6900e8 was a machine destroyed on 2025-02-27T16:30:03 and for some reason its data was reattached to 2865949f526758, that is from where the node confusion comes from.
I think you can clear the mess by destroying that machine and its volume, and then clone one of the healthy standbys (i.e.: e825d09a035658)
btw this was a good idea but now that all nodes are in the same region, it is better to point back to the old url so in case the primary fails over to any standby the app can continue working.
Thanks @dangra - you were correct that the error I was noticing had been going on longer than today’s incident.
I destroyed the messed up replica and created a new one, also removing the old node from the cluster and clearing its replica slot. All seems to be working well now, and the app has been moved back over to the appropriate URL
I am still curious about how we could have avoided the issue we had with the region migration. Messaging around the region consolidation explicitly indicated that no action should be required on our part, and yet it took down our DB. I am assuming that indicates a brittleness in our config that ought to be addressed, and would like to make sure we’re not in a position where this could happen again.
There are a few things combined here, as weird as it sounds the postgres cluster was healthy for the whole period even after one of its node moved to the SJC region, the others stayed in SEA. The problem was that the clients, your app, was also migrated but attempted to connect to the closest node that was the broken replica.
Eventually, all nodes had to be migrated, making the downtime much shorter. But the key difference between that node and the others is the used volume space. The broken node had 1.7GB used space while the healthies have 10GB. As a note, the volumes’ size is 40GB but used space is what determines when it is migrated.
Honestly, to prevent this in retrospective, I would go for one of these options:
The simplest, monitor the cluster have all nodes healthy.
A more foresight approach: You should have handled the migration by adding a replica on the destination region, forcing a primary failover plus rescaling client apps in the target region