I’m following the instructions for Performing a regional failover to change my flyio/postgres-flex:15.3 primary region from DFW to IAD.
I have 3 replica’s in IAD along with the 2 existing instances in DFW, I’ve updated the app config to use the environment variable in IAD, and everything looks poised to failover. When I run the command on step 8: fly pg failover --app my-db-app
the command it fails.
Error output from this command:
Performing a failover
Connecting to [old-leader-IPv6]… complete
Connecting to [old-leader-IPv6]… complete
Connecting to [old-leader-IPv6]… complete
Error promoting new leader, restarting existing leader
Waiting for old leader to finish stopping
Clearing existing machine lease…
Trying to start old leader
Old leader started succesfully
Error: Failed to run failover: no leader could be chosen. Here are the reasons why:
NEW-REGION-INSTANCE-ID-1: Running a dry run ofrepmgr standby switchover
failed. Try runningfly ssh console -u postgres -C 'repmgr standby switchover -f /data/repmgr.conf --dry-run' -s -a my-db-app
for more information. This was most likely due to the requirements for quorum not being met.
NEW-REGION-INSTANCE-ID-2: Running a dry run ofrepmgr standby switchover
failed. Try runningfly ssh console -u postgres -C 'repmgr standby switchover -f /data/repmgr.conf --dry-run' -s -a my-db-app
for more information. This was most likely due to the requirements for quorum not being met.
NEW-REGION-INSTANCE-ID-3: Running a dry run ofrepmgr standby switchover
failed. Try runningfly ssh console -u postgres -C 'repmgr standby switchover -f /data/repmgr.conf --dry-run' -s -a my-db-app
for more information. This was most likely due to the requirements for quorum not being met.
I see errors in the logs like this:
2024-02-15T08:25:30.100 app[NEW-REGION-INSTANCE-ID] iad [info] 2024/02/15 08:25:30 New SSH Session - my@emailcom
2024-02-15T08:25:31.727 app[OLD-REGION-PRIMARY-INSTANCE-ID] dfw [info] 2024/02/15 08:25:31 New SSH Session - Email Not Found
2024-02-15T08:25:31.755 app[OLD-REGION-PRIMARY-INSTANCE-ID] dfw [info] 2024/02/15 08:25:31 unexpected error: [ssh: no auth passed yet, not a cert, no fly email extension]
When running the suggested command; repmgr standby switchover
the output is:
NOTICE: checking switchover on node “[old-leader-IPv6]” (ID: 1570098223) in --dry-run mode
WARNING: unable to connect to remote host “[old-leader-IPv6]” via SSH
ERROR: unable to connect via SSH to host “[old-leader-IPv6]”, user “”
Error: ssh shell: Process exited with status 1
What could be going wrong here? It looks like there’s communication failure from the new regions replicas to the existing regions primary.