flyctl agent

A quick heads up about how I’m restructuring flyctl.

We currently have a big UX problem with user-mode WireGuard, which is that you can only have one WireGuard connection at a time (each connection is separately IPv6-addressed), and we can’t reasonably make new WireGuard connections for every run of flyctl. What happens now is, if you flyctl ssh into an instance in one window, and then do the same thing in another window, you’ll kill the first session when your second steals its IPv6 address.

It’s not the end of the world right now (we could just “lock” and only allow one SSH session at a time, and most people wouldn’t care) but it’s a real buzzkill if we want to use flyctl’s WireGuard for other stuff.

What I’m doing now is:

  • Adding flyctl agent daemon-start and agent start, a long-lived Unix domain socket (UDS) server (and a command that forks off that server as a background process).

  • Adding pkg/agent to flyctl with a simple text RPC protocol for the agent.

  • Moving most WireGuard logic into the agent, and driving it from normal flyctl with a client of that agent.

The agent handles an arbitrary number of different orgs. The UDS is $HOME/.fly/fly-agent.sock, protected by Unix filesystem permissions. The client exports a Dialer that you should ideally be able to drop into anything that accepts a DialContext.

This is going to be corner-casey AF. Everything I use on my Macbook that spawns a long-lived agent ends up leaking agents or orphaning its socket or whatever. I’m trying to be aggressive about this; there’s a single EstablishFlyAgent fn that will verify it can actually talk to the agent, and, if it can’t, will zap the UDS and fork off a new agent. Spawning a new agent is always preceded by killing the previous one, if possible.

What I’m hoping we can do is get this stabilized in a branch, a branch we all use for a couple weeks, before we inflict it on the world.

The other huge problem we’re going to have is Windows. What I expect to do here is stub out the client/agent stuff so all this stuff happens in-process on Windows — you won’t be able to run multiple WireGuard connections in different sessions on Windows, though.

Thoughts welcome.

2 Likes

I think more than 1 connection at a time is necessary. People are going to be deploying multiple apps that work together and probably will want to debug them at the same time.

Windows support is ok to forego for now. I expect this won’t be a huge % of our users and it might work under WSL?

Random: when I saw the title I first thought it was going to be about a program people can run on their 3rd-party infrastructure to connect to their private network via wireguard. agent is fairly common to “things you run on your own servers to expose something to a given service”. I know, in this case, that it’s more like ssh-agent though! This is a nit, but maybe it could be flyctl ssh agent? It’s all good either way. Maybe the “connect to your private network from anywhere” could be named flyctl tunnel or something.

So flyctl agent will mostly be transparent to the user? Nice.

This also reminds me: Why do we need to manually call flyctl ssh establish? :slight_smile: That could just be called automatically if we detect there isn’t any key for the user in the org.

I for one welcome this agent!

2 Likes

What happened here is I wrote an internal comment about flyctl and Kurt dEcLasSiFiEd it, and I will be careful about making the mistake of writing internal comments in complete sentences from now on, lest this befall me again.

In the meantime: this sort of works now. If you’re the kind of person who gets a kick out of playing around with Fly stuff for fun: you can check out the wireguard-agent branch of flyctl, and then make in the repo to build a flyctl that can… SSH in multiple different windows or tmux sessions or whatever.

The agent commands are under flyctl agentdaemon-start runs the agent on its own, without detaching, and with a stdout so you can see logs or whatever; stop will try (:grimacing:) to kill the agent (but also you can just kill it like a normal process). The idea, though, is that you’re not supposed to have to know it even exists, so you can just ssh like normal, except hopefully better.

My feeling is that the agent should also die after N minutes of inactivity, since starting a new agent is basically free.

Let me know if this infuriates anybody here, or if you have suggestions.

2 Likes

Appears to be working!

I found 2 issues from preliminary testing:

  • I don’t think connecting to apps from different orgs work with a single agent?
err handling establish: can't get wireguard state for personal: garbage stored in wireguard_state in config
  • --select doesn’t appear to work
$ bin/flyctl ssh console --select -a myapp
Error look up myapp: failed to retrieve instances: err bad command: [instances%!(EXTRA string=myorg, string= myapp)]

Good catch; --select works now.

I just tested multiple organizations and it worked fine? We should debug on Slack. :slight_smile: