Intermittent ssh failures

stevenwenxu · December 2, 2024, 12:46am

I can’t reproduce it 100% of the time, but it has happened quite frequently. This is the result of the fly ssh console command:

Connecting to fdaa:0:efa6:a7b:9d34:9e7f:4e1b:2...

Error: error connecting to SSH server: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

And then in the machine logs, this is what I see:

2024-12-01 19:00:40.580	
2024/12/01 18:05:04 ERROR unexpected error error="[ssh: no auth passed yet, ssh: cert is not yet valid]"
2024-12-01 19:00:32.421	
Machine started in 1.157s
2024-12-01 19:00:31.343	
2024-12-02T00:00:31.343281955 [01JBSG8B86M2BBXA3F36H9C3F6:fc_api] The API server received a Put request on "/logger" with body "{\"log_path\":\"logs.fifo\",\"level\":\"info\"}".
2024-12-01 19:00:31.342	
2024-12-02T00:00:31.341949698 [01JBSG8B86M2BBXA3F36H9C3F6:main] Running Firecracker v1.7.0

What’s going on here?

mayailurus · December 2, 2024, 11:13pm

Hi… The SSH certificates have fairly tight validity ranges (in terms of time), if I understand correctly, and it looks like either you got one that was set in the future or have a Machine with an incorrect clock (somehow).

Can you confirm that this line has a mismatched timestamp at the beginning? (It might just be a copy-and-paste glitch.)

(In contrast, the analogous two times equal each other when I SSH into one of my own machines…)

stevenwenxu · December 3, 2024, 1:11am

Good catch! I didn’t notice that. I can confirm I pasted it correctly.

The first timestamp is from Fly’s Grafana logs and it’s my local time. The second timestamp is printed by whichever process that logged the error, and when it works, it logs UTC time for me, which is consistent with the result from the date command inside the machine. I’m assuming this log line just prints the machine time.

So this 18:05 timestamp is clearly not UTC. Could the fly machine clock/timezone be set to a different one every time it’s restarted from the suspended state?

mayailurus · December 3, 2024, 3:44am

Hm… Initial clock anomalies are one of the known glitches of the new (and still fairly experimental, last I heard) suspend feature:

I was able to trigger a clock mismatch once, when SSHing into a suspended Machine right after fly m start, but it took several tries…

Anyway, it looks like it should be adequate to just wait a few more seconds after bumping into this—and then try SSH again (?).

stevenwenxu · December 3, 2024, 3:54pm

I see, that’s probably it! I do wait for like 5 seconds but maybe it’s not enough. Thank you!

system · December 10, 2024, 3:54pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.