`flyctl ssh console` intermittently times out

About half the time I run flyctl ssh console, it times out:

$ flyctl ssh console
Connecting to tunnel ⣽ Error tunnel unavailable: failed probing "personal": read tcp [fdaa:0:bff:a7b:1221:0:a:0]:43484->[fdaa:0:bff::3]:53: i/o timeout

It’s also slow in general – flyctl ssh console -c 'ls' takes about 5 seconds.

I have a workaround so this isn’t a big issue – just using plain ssh (ssh -o "StrictHostKeyChecking=no" root@mess-with-dns.internal) is working every time` and is a lot faster (maybe 500ms instead of 5s).

I’m curious, how fast is fly ssh console -swhere you have to select the actual instance?

it’s hard to tell because the variance is really high, it takes between 1.8 and 10 seconds.

Just to see where this extra time is going can you run LOG_LEVEL=debug fly ssh console and paste the logs that you see?

Here’s the output:

DEBUG Loaded flyctl config from/home/bork/.fly/config.yml
DEBUG determined hostname: "kiwi"
DEBUG determined working directory: "/home/bork/work/mess-with-dns"
DEBUG determined user home directory: "/home/bork"
DEBUG determined config directory: "/home/bork/.fly"
DEBUG ensured config directory exists.
DEBUG ensured config directory perms.
DEBUG cache loaded.
DEBUG config initialized.
DEBUG initialized task manager.
DEBUG skipped querying for new release
DEBUG client initialized.
DEBUG --> POST https://api.fly.io/graphql

{
  "query": "query ($appName: String!) { appbasic:app(name: $appName) { id name platformVersion organization { id slug } } }",
  "variables": {
    "appName": "mess-with-dns"
  }
}

DEBUG {}
DEBUG <-- 200 https://api.fly.io/graphql (2.78s)

{
  "data": {
    "appbasic": {
      "id": "mess-with-dns",
      "name": "mess-with-dns",
      "platformVersion": "nomad",
      "organization": {
        "id": "aaV5JD7y9pVvoTGeGQLvZ4RLvqiOee",
        "slug": "personal"
      }
    }
  }
}
DEBUG app config loaded from /home/bork/work/mess-with-dns/fly.toml
DEBUG Retrieving app info for mess-with-dns
DEBUG --> POST https://api.fly.io/graphql

{
  "query": "query ($appName: String!) { appcompact:app(name: $appName) { id name hostname deployed status appUrl platformVersion organization { id slug } } }",
  "variables": {
    "appName": "mess-with-dns"
  }
}

DEBUG {}
DEBUG <-- 200 https://api.fly.io/graphql (83.96ms)

{
  "data": {
    "appcompact": {
      "id": "mess-with-dns",
      "name": "mess-with-dns",
      "hostname": "mess-with-dns.fly.dev",
      "deployed": true,
      "appUrl": "https://213.188.214.254",
      "platformVersion": "nomad",
      "organization": {
        "id": "aaV5JD7y9pVvoTGeGQLvZ4RLvqiOee",
        "slug": "personal"
      },
      "status": "running"
    }
  }
}
DEBUG --> POST https://api.fly.io/graphql

{
  "query": "mutation($input: ValidateWireGuardPeersInput!) { validateWireGuardPeers(input: $input) { invalidPeerIps } }",
  "variables": {
    "input": {
      "peerIps": [
        "fdaa:0:bff:a7b:1221:0:a:2"
      ]
    }
  }
}

DEBUG {}
DEBUG <-- 200 https://api.fly.io/graphql (62.09ms)

{
  "data": {
    "validateWireGuardPeers": {
      "invalidPeerIps": []
    }
  }
}
Connecting to tunnel ⣽ Error tunnel unavailable: failed probing "personal": read tcp [fdaa:0:bff:a7b:1221:0:a:0]:43626->[fdaa:0:bff::3]:53: i/o timeout

Thanks @julia, this is looking like an issue related to your wireguard peer.

Can you run fly doctor and paste the results?

This will help pinpoint it. You might need to create a new wireguard peer connection flyctl wireguard create

$ fly doctor
Testing authentication token... PASSED
Testing flyctl agent... PASSED
Testing local Docker instance... PASSED
Pinging WireGuard gateway (give us a sec)... PASSED

Um that’s interesting.

Can you try creating a new wireguard peer and then rerun the fly ssh console command

How do I do that?

This is definitely the problem, and it’s presumably something on our side.

You can force us to create a new peer for you by running flyctl wireguard reset. I’m poking around now.

2 Likes

Sorry didn’t realise some of my message was missing, it was supposed to say:

Can you try creating a new wireguard peer with flyctl wireguard reset and then rerun the fly ssh console command

Resetting it seems to have fixed the problem, thanks!

Hrm. Curious. I’ve got enough info from your debug dump to do some hunting, but yeah, for future reference: your “interactive” WireGuard peers (the ones flyctl makes for you; they all have interactive in the name) are effectively disposable; if you delete them, flyctl (and flyctl agent) will notice and just make a new one for you. So if you’re seeing WireGuard-related wonkiness, you can always just flyctl wireguard reset to shake off the misbehaving peer connection.

But of course, this shouldn’t be happening in the first place!

2 Likes

For my mental model: are the Wireguard not peers not used when I do ssh root@mess-with-dns.internal? (I’m a bit confused about why one way of sshing worked and the other way didn’t)

It’s a good question. If you can use native ssh, a la ssh root@mess-with-dns.internal, you’ve got a “static” WireGuard peer set up that you created explicitly with flyctl wireguard create, added to your host WireGuard, and set up the DNS for. Presumably, you either have that WireGuard connection always-on, or explicitly turn it on before working with stuff in your organization.

When you use flyctl ssh console, we run WireGuard for you, in userland, behind the scenes (along with a complete TCP/IP stack). We keep those WireGuard connections in the flyctl agent, which is just a program that runs in the background that tries to keep WireGuard peers available and shareable across different invocations of flyctl.

So when you’re using flyctl ssh console, you’re asking the flyctl agent to enable the WireGuard peer (creating it if it isn’t already there), then probe it to see if it’s live (we do a trial DNS query across it to make sure it’s working), and only then make the actual 22/tcp SSH connection.

If flyctl agent’s WireGuard probe fails, we start over from the top, creating a new WireGuard peer (which will add a couple seconds of latency as we orchestrate the peer in our backend), probing it, and only then make the new connection.

The advantage to flyctl ssh console is that you don’t need root to set it up, and don’t have to change any configuration on your dev machine. But native WireGuard will always be faster, and probably? more reliable, though flyctl ssh should always eventually work.

2 Likes

that mostly makes sense, thanks!

this has also worked for me. cheers.

edit: worked, yes! however now it appears i need to reset wireguard every time i access the console.
edit edit: maybe not??? idk, ignore me!!