Unable to perform postgres regional failover

I’m following the instructions for Performing a regional failover to change my flyio/postgres-flex:15.3 primary region from DFW to IAD.

I have 3 replica’s in IAD along with the 2 existing instances in DFW, I’ve updated the app config to use the environment variable in IAD, and everything looks poised to failover. When I run the command on step 8: fly pg failover --app my-db-app the command it fails.

Error output from this command:

Performing a failover
Connecting to [old-leader-IPv6]… complete
Connecting to [old-leader-IPv6]… complete
Connecting to [old-leader-IPv6]… complete
Error promoting new leader, restarting existing leader
Waiting for old leader to finish stopping
Clearing existing machine lease…
Trying to start old leader
Old leader started succesfully
Error: Failed to run failover: no leader could be chosen. Here are the reasons why:
NEW-REGION-INSTANCE-ID-1: Running a dry run of repmgr standby switchover failed. Try running fly ssh console -u postgres -C 'repmgr standby switchover -f /data/repmgr.conf --dry-run' -s -a my-db-app for more information. This was most likely due to the requirements for quorum not being met.
NEW-REGION-INSTANCE-ID-2: Running a dry run of repmgr standby switchover failed. Try running fly ssh console -u postgres -C 'repmgr standby switchover -f /data/repmgr.conf --dry-run' -s -a my-db-app for more information. This was most likely due to the requirements for quorum not being met.
NEW-REGION-INSTANCE-ID-3: Running a dry run of repmgr standby switchover failed. Try running fly ssh console -u postgres -C 'repmgr standby switchover -f /data/repmgr.conf --dry-run' -s -a my-db-app for more information. This was most likely due to the requirements for quorum not being met.

I see errors in the logs like this:

2024-02-15T08:25:30.100 app[NEW-REGION-INSTANCE-ID] iad [info] 2024/02/15 08:25:30 New SSH Session - my@emailcom
2024-02-15T08:25:31.727 app[OLD-REGION-PRIMARY-INSTANCE-ID] dfw [info] 2024/02/15 08:25:31 New SSH Session - Email Not Found
2024-02-15T08:25:31.755 app[OLD-REGION-PRIMARY-INSTANCE-ID] dfw [info] 2024/02/15 08:25:31 unexpected error: [ssh: no auth passed yet, not a cert, no fly email extension]

When running the suggested command; repmgr standby switchover the output is:

NOTICE: checking switchover on node “[old-leader-IPv6]” (ID: 1570098223) in --dry-run mode
WARNING: unable to connect to remote host “[old-leader-IPv6]” via SSH
ERROR: unable to connect via SSH to host “[old-leader-IPv6]”, user “”
Error: ssh shell: Process exited with status 1

What could be going wrong here? It looks like there’s communication failure from the new regions replicas to the existing regions primary.

Hi… The above suggests that replicas perhaps didn’t get their SSH certificates at boot time.

What kinds of things do you see in the NAME column of the secrets list?

$ fly secrets list -a db-app-name
NAME               DIGEST            CREATED
FLY_CONSUL_URL     xxxxxxxxxxxxxxxx  May 17 2023
OPERATOR_PASSWORD  xxxxxxxxxxxxxxxx  May 17 2023
REPL_PASSWORD      xxxxxxxxxxxxxxxx  May 17 2023
SU_PASSWORD        xxxxxxxxxxxxxxxx  May 17 2023
SSH_KEY            xxxxxxxxxxxxxxxx  May 17 2023
SSH_CERT           xxxxxxxxxxxxxxxx  May 17 2023  ←

Added postgres

I’m seeing the exact same problem trying to failover a single instance in ams to 3 instances in cdg.

NAME                    DIGEST                  CREATED AT        
FLY_CONSUL_URL          3d44e7b51490ee4e        Sep 26 2023 15:56
OPERATOR_PASSWORD       0edbf5643307f98c        Sep 26 2023 15:56
REPL_PASSWORD           cb84e1d86f42a760        Sep 26 2023 15:56
SSH_CERT                766a71d34605eabf        Sep 26 2023 15:56
SSH_KEY                 40e228bfa695b798        Sep 26 2023 15:56
SU_PASSWORD             d1a333adee0aca57        Sep 26 2023 15:56

Hi… You have the right secret. Can you SSH from a replica to the AMS primary if you try it manually?

$ fly ssh console -a db-app-name --region ams
# echo "$FLY_REGION"  #double-check
# echo "$FLY_PRIVATE_IP"
# ssh-keyscan "$FLY_PRIVATE_IP" | ssh-keygen -l -f -
# exit

$ fly ssh console -a db-app-name --region cdg
# su postgres
> echo "$FLY_REGION"  #double-check
> ls -l ~/
> ls -l ~/.ssh/
> ssh <ams-ipv6-from-above>
Are you sure you want to continue connecting (yes/no/[fingerprint])?

At the final prompt, use the SHA256: column from ssh-keyscan.

(This is the fingerprint.)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.