More `not a cert` errors when SSHing within Postgres cluster

I recently added a barman node to an existing postgres cluster and am attempting to perform a point in time recovery from the barman machine. I’m running into ssh problems between the barman machine and the primary machine when I try to run the barman recover command and also when I try to ssh from one machine to another inside the cluster.

This appears to be a similar issue to what was happening here: Unable to perform postgres regional failover

Some context:

  • I have one primary postgres machine with no replicas, running flyio/postgres-flex:15.6
  • I followed the directions in the above linked article to create my barman machine, create the first backup, show backups, and attempt to restore from backup. Everything works until the recover command.
  • I did not create the primary and the barman machines at the same time, but they are on the same image.
1 Like

Thanks for the context @stephentgrammer !

I’m very curious about the SSH issue. For reference if I do this in my barman machine: ssh root@5683d269f75498.vm.pg-with-barman-config.internal -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -Tv

I get these logs:

root@9080e661b14348:/# ssh root@5683d269f75498.vm.pg-with-barman-config.internal -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -Tv
OpenSSH_9.2p1 Debian-2, OpenSSL 3.0.9 30 May 2023
debug1: Reading configuration data /root/.ssh/config
debug1: Executing command: 'nslookup '5683d269f75498.vm.pg-with-barman-config.internal.vm.pg-with-barman-config.internal' | awk '/^Address: / { print $2 }' | grep .'
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug1: Connecting to 5683d269f75498.vm.pg-with-barman-config.internal [fdaa:0:3335:a7b:135:2ead:721b:2] port 22.
debug1: Connection established.
debug1: identity file /root/.ssh/id_rsa type 3
debug1: identity file /root/.ssh/id_rsa-cert type 7
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa_sk type -1
debug1: identity file /root/.ssh/id_ecdsa_sk-cert type -1
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: identity file /root/.ssh/id_ed25519_sk type -1
debug1: identity file /root/.ssh/id_ed25519_sk-cert type -1
debug1: identity file /root/.ssh/id_xmss type -1
debug1: identity file /root/.ssh/id_xmss-cert type -1
debug1: identity file /root/.ssh/id_dsa type -1
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_9.2p1 Debian-2
debug1: Remote protocol version 2.0, remote software version Go
debug1: compat_banner: no match: Go
debug1: Authenticating to 5683d269f75498.vm.pg-with-barman-config.internal:22 as 'root'
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ssh-ed25519
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: SSH2_MSG_KEX_ECDH_REPLY received
debug1: Server host key: ssh-ed25519 SHA256:P5ja8m42HoVF7isdG0QNEvLdz+1ubcoLFDVM9ka4g4o
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory
Warning: Permanently added '5683d269f75498.vm.pg-with-barman-config.internal' (ED25519) to the list of known hosts.
debug1: rekey out after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey in after 134217728 blocks
debug1: Will attempt key: /root/.ssh/id_rsa ED25519 SHA256:4wGwxhXjEnJbcVBAi2EmT6/4n7U7lz0GFVZWNHTbGzA
debug1: Will attempt key: /root/.ssh/id_rsa ED25519-CERT SHA256:4wGwxhXjEnJbcVBAi2EmT6/4n7U7lz0GFVZWNHTbGzA
debug1: Will attempt key: /root/.ssh/id_ecdsa
debug1: Will attempt key: /root/.ssh/id_ecdsa_sk
debug1: Will attempt key: /root/.ssh/id_ed25519
debug1: Will attempt key: /root/.ssh/id_ed25519_sk
debug1: Will attempt key: /root/.ssh/id_xmss
debug1: Will attempt key: /root/.ssh/id_dsa
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,sk-ssh-ed25519@openssh.com,sk-ecdsa-sha2-nistp256@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,rsa-sha2-256,rsa-sha2-512,ssh-rsa,ssh-dss>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey
debug1: Next authentication method: publickey
debug1: Offering public key: /root/.ssh/id_rsa ED25519 SHA256:4wGwxhXjEnJbcVBAi2EmT6/4n7U7lz0GFVZWNHTbGzA
debug1: Authentications that can continue: publickey
debug1: Offering public key: /root/.ssh/id_rsa ED25519-CERT SHA256:4wGwxhXjEnJbcVBAi2EmT6/4n7U7lz0GFVZWNHTbGzA
debug1: Server accepts key: /root/.ssh/id_rsa ED25519-CERT SHA256:4wGwxhXjEnJbcVBAi2EmT6/4n7U7lz0GFVZWNHTbGzA
Authenticated to 5683d269f75498.vm.pg-with-barman-config.internal ([fdaa:0:3335:a7b:135:2ead:721b:2]:22) using "publickey".
debug1: channel 0: new session [client-session] (inactive timeout: 0)
debug1: Entering interactive session.
debug1: pledge: network
debug1: Sending environment.
debug1: channel 0: setting env LANG = "en_US.utf8"
debug1: pledge: fork

Also for reference my barman machine does have a /root/.ssh folder but my primary node does not. Can you verify this on your primary postgres node?

root@5683d269f75498:/data/.ssh# cat /data/.ssh/config
Match exec "nslookup '%h.vm.pg-with-barman-config.internal' | awk '/^Address: / { print $2 }' | grep ."
	HostName %h.vm.pg-with-barman-config.internal
	StrictHostKeyChecking no
	UserKnownHostsFile=/dev/null

If there’s no /data/.ssh/config there, can you try to write and see if it helps? I assume your barman node will already have it. We need this to make the primary be able to talk to the barman node properly when fetching data.

Let me know if this help your debugging

LOG OUTPUT
My logs look very similar to yours apart from ip addresses and app names. A couple notable differences:

Mine: Local version string SSH-2.0-OpenSSH_9.2p1 Debian-2+deb12u2
Yours: Local version string SSH-2.0-OpenSSH_9.2p1 Debian-2
Mine has a couple extra lines:

ssh_packet_read_poll2: resetting read seqnr 3
...
kex_input_ext_info: ping@openssh.com (unrecognised)

Then after Authentications that can continue: publickey, mine says:

debug1: Offering public key: /root/.ssh/id_rsa ED25519-CERT SHA256:bnaHzR3wCorGpl0ShYs6RZIvOevLfkoxm5YGujqfh10
debug1: Authentications that can continue: publickey
debug1: Trying private key: /root/.ssh/id_ecdsa
debug1: Trying private key: /root/.ssh/id_ecdsa_sk
debug1: Trying private key: /root/.ssh/id_ed25519
debug1: Trying private key: /root/.ssh/id_ed25519_sk
debug1: Trying private key: /root/.ssh/id_xmss
debug1: Trying private key: /root/.ssh/id_dsa
debug1: No more authentication methods to try.
root@<primary ip>.vm.<app name>.internal: Permission denied (publickey).

SSH CONFIG
/data/.ssh/config and /root/.ssh/config are all identical on both my primary and my barman nodes.

Match exec "nslookup '%h.vm.<app name>.internal' | awk '/^Address: / { print $2 }' | grep ."
	HostName %h.vm.<app name>.internal
	StrictHostKeyChecking no
	UserKnownHostsFile=/dev/null

Similarly, id_rsa are all identical, as well as id_rsa-cert.pub.
My root node does have two files that barman does not have:

  • /root/.ssh/authorized_keys (I may have created this one when I was debugging)
  • /data/.ssh/known_hosts (not sure about this one)

OK @lubien and @mayailurus, the plot is thickening. We cannot even do a normal failover in our qa db cluster. No sshing is working between our qa pg machines.

If we create a brand new db app, everything works great! Failovers, barnman recovery, machine-machine ssh, etc.

So it appears that something got borked on our cluster and invalidated the ssh cert or something. Our next idea was to update the app secrets and restart the machines, but I’m not sure what needs to change. We tried generating new SSH_KEY and SSH_CERT secrets and running fly secrets set but we were not confident in the parameters we used to create the key/cert and so were not surprised when that didn’t work.

We obviously could fork the volume and create a new app, but this seems like something we need to get to the bottom of. Any ideas?

It is! This is good additional information…

I’m surprised to see this line in particular, given the not a cert complaints from the SSH daemon.

Did you still get that message in its logs, immediately after the attempt above?

(When I try it, there is only New SSH Session - mayailurus@example.org.)

@lubien To wrap up this thread – we solved this by forking our db to a brand new fly app, and everything is working great! It’s still a mystery to me how and why our postgres cluster got corrupted enough that sshing between machines was impossible.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.