Automatic migration of SEA->SJC took Postgres offline

kevin-davis · October 30, 2025, 3:51pm

Just waking up to this. Widespread reports of our app being offline.

Notes:

A specific DB machine was created 2 hours ago (automatically) and all the failed connections appear to be to iut
That machine itself doesn’t seem to actually be connected to the DB which is almost certainly the problem
Problem seems more widespread - replica manager seems borked because of machine change. Working on getting a primary running smooth to migrate everything over to

Help?

dangra · October 30, 2025, 4:47pm

Hi Kevin. I was looking at your unmanaged postgres cluster, seems that only one node was migrated over to SJC and that caused all clients (already in SJC) to attempt to connect to it.

Now the migration of all postgres nodes to SJC have completed, the most important is that the primary moved over so clients should be connecting normally.

I see your app has restored its traffic, no more 500 status code responses, although not to normal levels.

Is it back or something else still going on?

zakthompson · October 30, 2025, 5:09pm

Hi there, I’m one of the engineers on the same application who looked into this this morning.

What we ended up doing to resolve ASAP was verify that the primary was functional, and then pointed our application specifically at that instance - ie all our app instances are now using the one machine exclusively

We are still seeing logs to suggest that our replicas are not working correctly - see the following snippet from one of our replicas:

| 025-10-30 13:04:52.178 | repmgrd | connection pointer is NULL | |
|----|:—|----|
| | | 2025-10-30 13:04:52.178 | repmgrd | [2025-10-30 17:04:52] [DETAIL] | |
| | | 2025-10-30 13:04:52.178 | repmgrd | [2025-10-30 17:04:52] [WARNING] connection to node “568306df6900e8” (ID: 305378526) lost | |
| | | 2025-10-30 13:04:52.178 | repmgrd | [2025-10-30 17:04:52] [DETAIL] PQping() returned “PQPING_NO_RESPONSE” | |
| | | 2025-10-30 13:04:52.178 | repmgrd | [2025-10-30 17:04:52] [WARNING] unable to ping “host=568306df6900e8.vm.dispatch-db.internal port=5433 user=repmgr dbname=repmgr connect_timeout=5” | |
| | | 2025-10-30 13:04:52.178 | repmgrd | SELECT repmgr.set_upstream_last_seen(1688476492) | |
| | | 2025-10-30 13:04:52.178 | repmgrd | [2025-10-30 17:04:52] [DETAIL] query text is: | |
| | | 2025-10-30 13:04:52.178 | repmgrd | [2025-10-30 17:04:52] [ERROR] unable to execute repmgr.set_upstream_last_seen() | |
| | | 2025-10-30 13:04:50.164 | repmgrd | SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, ‘’ AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 305378526 | |
| | | 2025-10-30 13:04:50.164 | repmgrd | [2025-10-30 17:04:50] [DETAIL] query text is: | |
| | | 2025-10-30 13:04:50.163 | repmgrd | [2025-10-30 17:04:50] [ERROR] _get_node_record(): unable to execute query | |
| | | 2025-10-30 13:04:50.163 | repmgrd | [2025-10-30 17:04:50] [WARNING] reconnection to node “568306df6900e8” (ID: 305378526) failed | |
| | | 2025-10-30 13:04:50.163 | repmgrd | user=repmgr connect_timeout=5 dbname=repmgr host=568306df6900e8.vm.dispatch-db.internal port=5433 fallback_application_name=repmgr options=-csearch_path= | |
| | | 2025-10-30 13:04:50.163 | repmgrd | [2025-10-30 17:04:50] [DETAIL] attempted to connect using: | |
| | | 2025-10-30 13:04:50.163 | repmgrd | | |
| | | 2025-10-30 13:04:50.163 | repmgrd | could not translate host name “568306df6900e8.vm.dispatch-db.internal” to address: No address associated with hostname | |
| | | 2025-10-30 13:04:50.163 | repmgrd | [2025-10-30 17:04:50] [DETAIL] | |
| | | 2025-10-30 13:04:50.163 | repmgrd | [2025-10-30 17:04:50] [ERROR] connection to database failed | |
| | | 2025-10-30 13:04:50.163 | repmgrd | [2025-10-30 17:04:50] [INFO] attempting to reconnect to node “568306df6900e8” (ID: 305378526) | |
| | | 2025-10-30 13:04:50.163 | repmgrd | | |
| | | 2025-10-30 13:04:50.163 | repmgrd | connection pointer is NULL | |
| | | 2025-10-30 13:04:50.163 | repmgrd | [2025-10-30 17:04:50] [DETAIL] | |
| | | 2025-10-30 13:04:50.163 | repmgrd | [2025-10-30 17:04:50] [WARNING] connection to node “568306df6900e8” (ID: 305378526) lost | |
| | | 2025-10-30 13:04:50.163 | repmgrd | [2025-10-30 17:04:50] [DETAIL] PQping() returned “PQPING_NO_RESPONSE” | |
| | | 2025-10-30 13:04:50.163 | repmgrd | [2025-10-30 17:04:50] [WARNING] unable to ping “host=568306df6900e8.vm.dispatch-db.internal port=5433 user=repmgr dbname=repmgr connect_timeout=5” | |
| | | 2025-10-30 13:04:50.103 | repmgrd | SELECT repmgr.set_upstream_last_seen(1688476492) | |
| | | 2025-10-30 13:04:50.103 | repmgrd | [2025-10-30 17:04:50] [DETAIL] query text is: | |
| | | 2025-10-30 13:04:50.103 | repmgrd | [2025-10-30 17:04:50] [ERROR] unable to execute repmgr.set_upstream_last_seen() | |
| | | 2025-10-30 13:04:49.589 | postgres | 2025-10-30 17:04:49.588 UTC [17922] STATEMENT: START_REPLICATION SLOT “repmgr_slot_305378526” 0/26000000 TIMELINE 3 | |
| | | 2025-10-30 13:04:49.589 | postgres | 2025-10-30 17:04:49.588 UTC [17922] ERROR: requested WAL segment 000000030000000000000026 has already been removed | |
| | | 2025-10-30 13:04:49.588 | postgres | 2025-10-30 17:04:49.587 UTC [803] LOG: waiting for WAL to become available at 0/26002000 | |
| | | 2025-10-30 13:04:49.588 | postgres | 2025-10-30 17:04:49.587 UTC [13018] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000030000000000000026 has already been removed | |
| | | 2025-10-30 13:04:49.587 | postgres | 2025-10-30 17:04:49.587 UTC [13018] LOG: started streaming WAL from primary at 0/26000000 on timeline 3 |

It would seem that things definitely got out of sync in the time period that some machines were migrated over and others weren’t.

What’s the best way to resolve the issue and get our replicas working correctly again
How could we have avoided this/how can we avoid it in the future?

dangra · October 30, 2025, 5:34pm

I looked at that error but seems unrelated to today events.

That standby node has been lagging behind for 245 days (21171421 seconds).

postgres@2865949f526758:~$ repmgr node check
WARNING: node "2865949f526758" not found in "pg_stat_replication"
Node "568306df6900e8":
        Server role: OK (node is standby)
        Replication lag: CRITICAL (21171421 seconds, critical threshold: 600))
        WAL archiving: OK (0 pending archive ready files)
        Upstream connection: CRITICAL (node "568306df6900e8" (ID: 305378526) is not attached to expected upstream node "7811ed4b4e3238" (ID: 1688476492))
        Downstream servers: OK (this node has no downstream nodes)
        Replication slots: OK (node has no physical replication slots)
        Missing physical replication slots: OK (node has no missing physical replication slots)
        Configured data directory: OK (configured "data_directory" is "/data/postgresql")

Not a big deal because there are two more nodes replicating fine.

postgres@7811ed4b4e3238:~$ repmgr cluster show
WARNING: node "2865949f526758" not found in "pg_stat_replication"
 ID         | Name           | Role    | Status    | Upstream         | Location | Priority | Timeline | Connection string
------------+----------------+---------+-----------+------------------+----------+----------+----------+---------------------------------------------------------------------------------------------------
 305378526  | 2865949f526758 | standby | ! running | ! 7811ed4b4e3238 | sjc      | 100      | 3        | host=2865949f526758.vm.dispatch-db.internal port=5433 user=repmgr dbname=repmgr connect_timeout=5
 1132989670 | 683dd95f19d478 | standby |   running | 7811ed4b4e3238   | atl      | 0        | 3        | host=683dd95f19d478.vm.dispatch-db.internal port=5433 user=repmgr dbname=repmgr connect_timeout=5
 1284576639 | e825d09a035658 | standby |   running | 7811ed4b4e3238   | sjc      | 100      | 3        | host=e825d09a035658.vm.dispatch-db.internal port=5433 user=repmgr dbname=repmgr connect_timeout=5
 1688476492 | 7811ed4b4e3238 | primary | * running |                  | sjc      | 100      | 3        | host=7811ed4b4e3238.vm.dispatch-db.internal port=5433 user=repmgr dbname=repmgr connect_timeout=5

WARNING: following issues were detected
  - node "2865949f526758" (ID: 305378526) is running but the repmgr node record is inactive
  - node "2865949f526758" (ID: 305378526) is not attached to its upstream node "7811ed4b4e3238" (ID: 1688476492)

btw, 568306df6900e8 was a machine destroyed on 2025-02-27T16:30:03 and for some reason its data was reattached to 2865949f526758, that is from where the node confusion comes from.

I think you can clear the mess by destroying that machine and its volume, and then clone one of the healthy standbys (i.e.: e825d09a035658)

dangra · October 30, 2025, 5:46pm

btw this was a good idea but now that all nodes are in the same region, it is better to point back to the old url so in case the primary fails over to any standby the app can continue working.

zakthompson · October 30, 2025, 8:26pm

Thanks @dangra - you were correct that the error I was noticing had been going on longer than today’s incident.

I destroyed the messed up replica and created a new one, also removing the old node from the cluster and clearing its replica slot. All seems to be working well now, and the app has been moved back over to the appropriate URL

I am still curious about how we could have avoided the issue we had with the region migration. Messaging around the region consolidation explicitly indicated that no action should be required on our part, and yet it took down our DB. I am assuming that indicates a brittleness in our config that ought to be addressed, and would like to make sure we’re not in a position where this could happen again.

dangra · October 30, 2025, 9:15pm

There are a few things combined here, as weird as it sounds the postgres cluster was healthy for the whole period even after one of its node moved to the SJC region, the others stayed in SEA. The problem was that the clients, your app, was also migrated but attempted to connect to the closest node that was the broken replica.

Eventually, all nodes had to be migrated, making the downtime much shorter. But the key difference between that node and the others is the used volume space. The broken node had 1.7GB used space while the healthies have 10GB. As a note, the volumes’ size is 40GB but used space is what determines when it is migrated.

Honestly, to prevent this in retrospective, I would go for one of these options:

The simplest, monitor the cluster have all nodes healthy.
A more foresight approach: You should have handled the migration by adding a replica on the destination region, forcing a primary failover plus rescaling client apps in the target region
The ultimate: Switch to Managed Postgres.

I hope that helps
D.