Postgres failover fails. Unable to connect via SSH

I have a Postgres cluster with 3 machines. When I run fly postgres failover the failover fails. This failover test was working fine 4 days ago after I had just created the cluster.

Here are the details. I have replaced my postgres app name with my-db-app:

% fly postgres failover --app my-db-app
Performing a failover
Connecting to fdaa:5:8e54:a7b:15d:7860:8567:2... complete
Connecting to fdaa:5:8e54:a7b:2809:662f:81c8:2... complete
Error promoting new leader, restarting existing leader
Waiting for old leader to finish stopping
Clearing existing machine lease...
Trying to start old leader
Old leader started succesfully
Error: Failed to run failover: no leader could be chosen. Here are the reasons why:
3d8ddedb0edd98: Running a dry run of `repmgr standby switchover` failed. Try running `fly ssh console -u postgres -C 'repmgr standby switchover -f /data/repmgr.conf --dry-run' -s -a my-db-app` for more information. This was most likely due to the requirements for quorum not being met.
9185e76df14528: Running a dry run of `repmgr standby switchover` failed. Try running `fly ssh console -u postgres -C 'repmgr standby switchover -f /data/repmgr.conf --dry-run' -s -a my-db-app` for more information. This was most likely due to the requirements for quorum not being met.

please fix one or more of the above issues, and try again

When I run the suggested command from either of the replica machines I get this:

% fly ssh console -u postgres -C 'repmgr standby switchover -f /data/repmgr.conf --dry-run' -s -a my-db-app
? Select VM: lhr: 3d8ddedb0edd98 fdaa:5:8e54:a7b:2809:662f:81c8:2 polished-morning-557 (replica)
Connecting to fdaa:5:8e54:a7b:2809:662f:81c8:2... complete
NOTICE: checking switchover on node "fdaa:5:8e54:a7b:2809:662f:81c8:2" (ID: 322771910) in --dry-run mode
WARNING: unable to connect to remote host "fdaa:5:8e54:a7b:18:96db:5d92:2" via SSH
ERROR: unable to connect via SSH to host "fdaa:5:8e54:a7b:18:96db:5d92:2", user ""
Error: ssh shell: Process exited with status 1

From this, it appears that the replica machines are not able to connect via SSH to the primary machines. When I ssh into one of the replicas I am not able to then SSH into the primary or the other replica. The error I see is this:

# ssh fdaa:5:8e54:a7b:18:96db:5d92:2
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
[REDACTED].
Please contact your system administrator.
Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Offending ED25519 key in /root/.ssh/known_hosts:2
  remove with:
  ssh-keygen -f "/root/.ssh/known_hosts" -R "fdaa:5:8e54:a7b:18:96db:5d92:2"
Host key for fdaa:5:8e54:a7b:18:96db:5d92:2 has changed and you have requested strict checking.
Host key verification failed.

Trying to SSH from the second replica machine to the primary machine gives me the same error as above. After running the suggested ssh-keygen command, I still cannot SSH into the primary machine, and I get the following error:
Permission denied (publickey)

Could someone please help me resolve this? With a new Postgres cluster the failover works fine, and I have not done anything to modify the machines, but failover just stopped working.

Many thanks

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.