Remote builders on MacOS not working (reproducible)

Starting a new topic, as the existing one wasn’t MacOS specific.

fly deploy --remote-only does not seem to work on M1 Silicon machines, no matter what I tried.

Summary
Regardless of all the combinations tried below, running fly deploy --remote-only gets stuck connecting to the remote builder. Error below:

Waiting for remote builder fly-builder-dawn-wood-4441... connecting ⡿ Error failed to fetch an image or build from source: error connecting to docker: unable to connect WireGuard tunnel: context deadline exceeded

This only happens on MacOS or a VM hosted inside MacOS. A clean Linux VM hosted on GCP works fine. But a clean Mac hosted on a datacenter (MacStadium) does not. I tried multiple accounts, so it’s not specific to me.

Fly CLI versions tested

  • From nixpkgs: flyctl v0.0.0-1643132251+dev darwin/arm64 Commit: BuildDate: 2022-01-25T12:37:31-05:00
  • From brew latest: fly v0.0.286 darwin/arm64 Commit: cd174ea BuildDate: 2022-01-23T12:22:32Z

OS tested

  • Monterey

Connections tested

  • Residential FIOS wifi access
  • T-Mobile hotspot
  • MacStadium data center

Devices tested

  • M1 MacBook Pro at home
  • Clean MacStadium bare metal M1 in Atlanta.
  • Linux VM hosted on GCP (only thing that worked).

Accounts tested

  • Personal account created a couple of weeks ago. Validated credit card.
  • Company account created today. Validated credit card.

Let me know what more information I can provide, happy to help.

Ok this is super helpful, it’s not mac specific but the details might’ve helped us track down a bug. Give us a few hours and we may have a fix for you.

@Silvio_Gutierrez give it another try? We fixed a sync issue that was preventing many newly made keys from working in Virginia. I’ve confirmed most of the peer keys I see for you now.

Just tried it on the MacStadium and it’s working.

Still not working on my local Mac. Though I had to re-sign in, so maybe a new “stuck” peer key was created.

So is this to say it wasn’t at all Mac related, but the fact that the MacStadium VM was in Atlanta, and I’m in NYC, we both got Virginia peer keys? Why does MacStadium work but not my local machine?

The Linux VM, on GCP, worked the whole time, but it was in us-central1, which is based in Iowa.

Let me know what more info I can provide.

I’m guessing the Linux VM connected to Chicago. You can look at ~/.fly/config.yml and see, the region is in the endpoint hostname.

Try running fly agent stop on your local Mac and see if that helps? If that doesn’t work, try going through the fly wg create process to setup a manual wireguard connection. Then see if you can ping your app instance IPs.

I can confirm Linux was ORD, from doing fly wg list. Trying other things now.

Alas, fly agent stop (and start) did not work.

For wireguard, I’m not quite sure how to test this, but I did:

flyctl wg create

and saved to testing.conf. Imported that into WireGuard, activated and ran:

ping 213.188.208.103 (worked)
dig 213.188.208.103 (worked)
dig 2a09:8280:1::3:1c52 (worked)

But then I deactivated Wireguard and all of the commands above worked fine too, so I’m not sure if that’s a real test. Is there a way to ping an IP that’s not public to the internet?

This is all for one of the app instances. The builder has no IP exposed in the UI, so I’m not sure how to ping it.

Is there a way to force flyctl deploy to use the ORD peer?

Actually kept running deploy, and started getting:

Error not possible to validate configuration: server returned Could not resolve App

For an app that clearly existed in the dashboard. So I deleted ~/.fly, re ran auth, redeployed… and now it’s working.

Ah! You’ll need to run fly ips private to get a list of the private IPs. These are per VM, different than the public anycast IPs.

You can (temporarily) copy the peer block from ~/.fly/config.yml between hosts. Then run fly agent stop and when it comes back up it should use the ORD peer. I’m curious what happens, I’m fairly sure the IAD peers are in a good state so you might find that the ORD peer also hangs.

Well isn’t that interesting.

Deleting ~/.fly forced your client to create a new peer, which is active now. I think the Fly agent got itself in a bad state with the old ones. fly agent stop should have worked, though.

If you run ps aux | grep "fly agent" do you see more than one running?

I do, and as I suspected, it’s because of my brew/nix experimentation:

[nix-shell:~/Sites/silviogutierrez/reactivated/genesis/testproject]$  ps aux | grep "fly"
silviogutierrez  23747   0.0  0.1 409581872  80288 s004  S     3:47PM   0:12.01 fly agent daemon-start
silviogutierrez   9553   0.0  0.1 409262640  38272 s000  S    Mon06PM   1:43.53 flyctl agent daemon-start
silviogutierrez  25449   0.0  0.0 407965088    720 s004  R+    6:51PM   0:00.00 grep fly
silviogutierrez  25399   0.0  0.1 409254064  41248 s004  S     6:46PM   0:00.67 flyctl agent daemon-start
silviogutierrez  25169   0.0  0.1 409268608  48816 s004  S     6:37PM   0:01.05 flyctl agent daemon-start

Note the mix of flyctl and fly. I realize there may be deeply complex technical reasons to need an agent/daemon, but nix tends to promote completely hermetic environments (I’ve done it even with postgresql daemons). Would be cool to keep that in mind for the roadmap: multiple instances of the fly cli existing, maybe with $FLYHOME (though that would require re-auth per instance).

There’s another thread out there that’s very popular of Rust/nix/flyctl usage, so I’m not the only one pushing this usage. It’s amazing to embed the flyctl into the project and make launching your code trivial.

There is a deeply complex reason! Our agent is a userland network interface, basically. The $FLYHOME change would (not, apparently) help, we also have some fixes coming that will more reliably prevent multiple agents.

Go ahead and kill all those and see if it comes back?

It’s important that, for a given Fly.io user, irrespective of how many hermetically sealed home directories you’ve got on your machine, there only be one running agent.

Right now, we rely on $HOME/.fly/fly-agent.sock being a static path, so that when we start a new agent (for any reason), we can kill off the old one.

The sole purpose for the agent is to handle multiple WireGuard connections — without the agent, if you run flyctl ssh console in one window, and then again in another window (or, for that matter, flyctl dig or flyctl proxy), the more recent flyctl will kill the session of the previous one.

We’re discussing this thread internally! Thank you for sharing your thoughts. From what I understand, I think a $FLYHOME variable would actually make this worse?

Good as dead, they did not come back.

Then doing: flyctl deploy --remote-only once more is stuck (running ps shows only one daemon working).

Then running flyctl agent stop and retrying, still doesn’t work.

Then running flyctl agent stop, killing ~/.fly, rerunning auth, and redeploying does work.

So it’s a little funky, but I think for now , just ensuring a single fly daemon and killing ~/.fly and reauthing seems somewhat reliable.

Can you try installing the latest prerelease?

curl -L https://fly.io/install.sh | sh -s pre

It has some major fixes to the agent which might help with this.

Also, I took a peek at the nix package and saw that it’s an old version built slightly differently than our main releases. I would recommend using our builds directly if possible since we release frequently and can make sure upgrade paths are smooth. If Nix is something folks want maybe there’s a way to push our builds from our CI.

Here’s a barbaric POC of PostgreSQL instances per project: reactivated/shell.nix at fc7e7896834d5c4a8a1d3d2ebc7781dfe5706b7d · silviogutierrez/reactivated · GitHub

(Sockets are limited to 104 characters, so they can’t be colocated in deeply rooted project paths. Fun)

After running the above shellHook, running psql will connect to that instance and that instance only.

I think having a .sock per instance for flyctl wouldn’t make it worse, so long as auth were not based off of FLYHOME. Really, you’d basically want FLYSOCKET and FLYHOME. The latter rarely changing, and FLYSOCKET set to an arbitrary folder per project. Of course, auth per project is not the end of the world.

With the above, in theory, fly ssh would be the same as psql. The latter is a psql client that connects to its own postgres instance. But I don’t know the wireguard internals enough to know if, even beyond fly, there’s some wireguard black magic that has global state and can’t be isolated.

Lot of random thoughts, hopefully some of that is helpful for your internal discussions. Feel free to ping me for any more background.

I’ll definitely keep this in mind. I’m biased, but think nix is the future of… everything. But in any case, so long as upstream nix is kept up to date, I can use unstable nixpkgs to always get the latest published version, without needing to upgrade the rest of the project (nix lets you mix and match).

And if you can’t figure out a workaround for publishing quickly, one can always just have a “nix” package that actually just downloads the latest release. So long as you are able to provide a “stable” URL that always downloads the latest binary. Per platform is fine.

Rough, untested example:

{ pkgs }:

 with pkgs;

 let
   download = fetchurl (if stdenv.isDarwin then {
     url =
       "https://fly.io/downloads/mac-latest";
     sha256 = "17sfsc7by94xn9728x21zfh74xiw0davg6sz9bpq8dd7rvcas7is";
   } else {
     url =
       "https://fly.io/downloads/linux-latest";
     sha256 = "00qbi9aqvpmspgrav6sk6ck1v91x2x8sm418kagdzwbaw70i7jrr";
   });

 in derivation {
   name = "flyctl";
   inherit coreutils download;
   builder = "${bash}/bin/bash";
   args = [
     "-c"
     ''
       unset PATH;
       export PATH=$coreutils/bin;
       mkdir -p $out/bin;
       cp $download $out/bin/fly;
       chmod +x $out/bin/fly;
     ''
   ];

   system = builtins.currentSystem;
 }

Then again, once the cli stabilizes a bit, it may not be needed. Just food for thought.

It looks like there’s a bot that scrapes Github releases and submits updates, though not that often. What concerns me here is that this PR is almost a month old, therefore it’s many versions behind our current release.

I pinged the Nix community to see what they think about this.