wireguard is ... directional somehow?

I just had a weird issue with the wireguard / private networking setup in my app. The application itself is a small nginx that reverse-proxies to my home network server, which dials into fly.io via wireguard.

So, for a few days now, connections couldn’t be established from my fly.io pod to the backend; I’d earmarked it for debugging later, but looked at it in detail only today. The issue is that the wireguard connection after I boot up my server doesn’t seem to be active:

:;    sudo wg show flytunnel0
interface: flytunnel0
  public key: [redacted]
  private key: (hidden)
  listening port: 37606

peer: [redacted]
  endpoint: 104.225.8.204:51820
  allowed ips: fdaa:0:704::/48
  persistent keepalive: every 15 seconds

And so, connections made from the frontend nginx to the backend timed out.

I decided to take a look at fly ips private, which lists a private IPv6 address, and pung that address from the backend:

:;    ping fdaa:0:704:a7b:ab9:dd0e:d61:2
PING fdaa:0:704:a7b:ab9:dd0e:d61:2(fdaa:0:704:a7b:ab9:dd0e:d61:2) 56 data bytes
64 bytes from fdaa:0:704:a7b:ab9:dd0e:d61:2: icmp_seq=1 ttl=62 time=36.7 ms
64 bytes from fdaa:0:704:a7b:ab9:dd0e:d61:2: icmp_seq=2 ttl=62 time=15.9 ms
^C

Huh, the connection seems to work?! What?

And lo:

:;    sudo wg show flytunnel0
interface: flytunnel0
  public key: [redacted]
  private key: (hidden)
  listening port: 37606

peer: [redacted]
  endpoint: 104.225.8.204:51820
  allowed ips: fdaa:0:704::/48
  latest handshake: 14 seconds ago
  transfer: 25.01 KiB received, 18.39 KiB sent
  persistent keepalive: every 15 seconds

Now my incoming connections from the frontend to the backend no longer fail either.

This all adds up to the question in the subject here: Do I have to do anything to ensure that the wg vpn connection gets established? Is there some NAT-boring magic I can/have to invoke? Some additional setting I can adjust in wireguard on the backend host to ensure that it’s reachable to the fly pods?

Thanks in advance!

Just to make sure I’m following this:

On your Linux home-net box, you have a native Fly.io WireGuard connection set up, and on Fly.io you have an instance that runs nginx and proxies incoming requests from our Anycast edge to your home-net over that WireGuard connection.

When you look at your local home-net wg interface, it reports that it’s never handshaked before. Connections from Fly.io to your home-net time out, because there’s no live WireGuard connection to route packets over (Fly.io can’t initiate WireGuard connections to your home-net).

You ping across it, from home-net to Fly.io, and WireGuard is forced to handshake; there is now a live connection, and things work.

We’re still noodling a bit about this internally but one thing to consider is trying a PersistentKeepAlive setting on your home-net wg configuration.

Thanks for the ultra-quick reply, Thomas! You got it, that’s exactly what seems to be happening.

I think I am running with persistent-keepalive, at least that’s what wg show seems to indicate above… that’s why I’m so confused - I was expecting NAT traversal to be potentially wonky & made sure that something is in place to help with that.

I have one outlandish theory on what could be happening: I’m on an infrequently-changing dynamic IP address. Given that the backend only ever receives connections, could a renumbering have caused the fly wg peer to attempt to send its packets to my old address, with no clue what the new address is?

You might try dropping the interval on the PersistantKeepalive, which we set pretty long by default. TBH: if this was me I’d probably just set up a ping. :slight_smile:

We’re still talking about this here though.

:D, yeah I added fping fdaa:0:704::3 to my wg-quick post-up script, and now the interface comes up and is ready for serving immediately.

I guess I should use something continuous, like the prometheus blackbox prober, to ensure the network connection remains up… but then I’m in “persistent-keepalive” territory again. I thought 1 packet every 15s would be enough to keep connections alive, but maybe it just can’t, help the connection come to life initially.

Anyway - everything seems to work now that I’m sending traffic towards fly.io over the wg connection. Thanks!