WireGuard and Remote Builder Fixes in flyctl

tldr: we fixed a bunch of private network bugs with a pre release you can try:

curl -L https://fly.io/install.sh | sh -s pre

A lot of people are having issues connecting to VMs over the private network lately, particularly with remote builders. We had a heck of a time reproducing some of them, but after digging in for the past two weeks we made progress. Here’s some of the noteworthy problems and fixes…

Agent state

Last month we released the new flyctl agent, which is a daemon to multiplex multiple connections through a WireGuard tunnel. The agent worked great under normal use, but edge cases we hadn’t considered let it get into a bad state that was hard to recover from.

The most common issue was a WireGuard peer being removed from the config file (eg flyctl wg remove) that the agent continued to use. The agent would also use peers in the config file without validating that they existed in the API first. We’re now performing cleanup and validation of peers in the config and active tunnels.

Verify WireGuard tunnel before using the dialer

Propagating new WireGuard peers to gateways sometimes takes a few minutes, especially for faraway regions. If we attempt to use a tunnel before the peer is ready, the DNS resolver threw a misleading i/o timeout error. We’re now detecting this and testing that the tunnel is available before using the connection.

Use TCP instead of UDP for DNS

We test WireGuard tunnels by resolving TXT _apps.internal . If the organization has a lot of apps, the response would sometimes be larger than the 512 byte limit and throw an error that looked like the connection failed. We switched the tunnel’s resolver to TCP which fixed that issue and also produces better error messages when resolution actually fails.

Connect to remote builders through the agent

For whatever reason, we didn’t update the remote builder session code to use the agent. The result was remote builder sessions would kill agent connections, and the agent reconnecting would disconnect remote builder sessions. We’re now using the agent for everything so connections aren’t interrupted.


If you’re dying to see the actual fixes, check out Release v0.0.233-pre-1 · superfly/flyctl · GitHub.

3 Likes

Just a quick note on the state update lags Michael mentions in the “Verify WireGuard tunnel” section —

A few important bits of state for your organizations — most importantly, DNS entries and WireGuard peer information — are synchronized through HashiCorp Consul.

Consul is great, but we are pushing our deployment of it to some limits, and as a result there are some updates that should happen very fast that instead take dozens of seconds. The flyctl changes here patch around that lag (@rugwiro and @michael made my very bad error handling much more resilient).

But the lag is itself bad! We’re working on that too. DNS, in particular, is painful; your instance can be up and responsive to traffic and working for customers, but flyctl can’t talk to it directly until DNS propagates, which makes things feel sometimes like they aren’t working as well as they are. Not ok!

We’ll have updates there as well, a bunch of different things, so that state propagation will hopefully soon stop being a thing we have to think about much.

Thanks for bearing with us!

1 Like

While testing out this prerelease we added a few more bug fixes and improvements:

  • buildpack builds lasting longer than ~30s were hanging once we started to put remote builds through the agent. The agent’s tcp proxy was waiting for both the client and server to close their connections, but dockerd was keeping it open until we explicitly called CloseWrite after we finished reading from the client.
  • We use a custom DNS resolver to forward private DNS queries through the WireGuard tunnel. Custom DNS resolvers aren’t supported on Windows builds. This caused readiness checks against the tunnel and host to fail as well as erroring when fetching a list of available hosts with the flyctl ssh console -s flag. We fixed these issues by swapping stdlib’s net.Resolver with new resolver functions based on the fantastic miekg/dns package.
  • flyctl ssh console now provides feedback as it tests the tunnel and host before starting the SSH session.
  • SIGINT works on windows when an SSH session or remote build is in progress.

Unless we find any major issues we’ll be releasing this tomorrow morning.