tldr: we fixed a bunch of private network bugs with a pre release you can try:
curl -L https://fly.io/install.sh | sh -s pre
A lot of people are having issues connecting to VMs over the private network lately, particularly with remote builders. We had a heck of a time reproducing some of them, but after digging in for the past two weeks we made progress. Here’s some of the noteworthy problems and fixes…
Agent state
Last month we released the new flyctl agent, which is a daemon to multiplex multiple connections through a WireGuard tunnel. The agent worked great under normal use, but edge cases we hadn’t considered let it get into a bad state that was hard to recover from.
The most common issue was a WireGuard peer being removed from the config file (eg flyctl wg remove
) that the agent continued to use. The agent would also use peers in the config file without validating that they existed in the API first. We’re now performing cleanup and validation of peers in the config and active tunnels.
Verify WireGuard tunnel before using the dialer
Propagating new WireGuard peers to gateways sometimes takes a few minutes, especially for faraway regions. If we attempt to use a tunnel before the peer is ready, the DNS resolver threw a misleading i/o timeout
error. We’re now detecting this and testing that the tunnel is available before using the connection.
Use TCP instead of UDP for DNS
We test WireGuard tunnels by resolving TXT _apps.internal
. If the organization has a lot of apps, the response would sometimes be larger than the 512 byte limit and throw an error that looked like the connection failed. We switched the tunnel’s resolver to TCP which fixed that issue and also produces better error messages when resolution actually fails.
Connect to remote builders through the agent
For whatever reason, we didn’t update the remote builder session code to use the agent. The result was remote builder sessions would kill agent connections, and the agent reconnecting would disconnect remote builder sessions. We’re now using the agent for everything so connections aren’t interrupted.
If you’re dying to see the actual fixes, check out Release v0.0.233-pre-1 · superfly/flyctl · GitHub.