Starting a new topic, as the existing one wasn’t MacOS specific.
fly deploy --remote-only does not seem to work on M1 Silicon machines, no matter what I tried.
Summary
Regardless of all the combinations tried below, running fly deploy --remote-only gets stuck connecting to the remote builder. Error below:
Waiting for remote builder fly-builder-dawn-wood-4441... connecting ⡿ Error failed to fetch an image or build from source: error connecting to docker: unable to connect WireGuard tunnel: context deadline exceeded
This only happens on MacOS or a VM hosted inside MacOS. A clean Linux VM hosted on GCP works fine. But a clean Mac hosted on a datacenter (MacStadium) does not. I tried multiple accounts, so it’s not specific to me.
Fly CLI versions tested
From nixpkgs: flyctl v0.0.0-1643132251+dev darwin/arm64 Commit: BuildDate: 2022-01-25T12:37:31-05:00
From brew latest: fly v0.0.286 darwin/arm64 Commit: cd174ea BuildDate: 2022-01-23T12:22:32Z
OS tested
Monterey
Connections tested
Residential FIOS wifi access
T-Mobile hotspot
MacStadium data center
Devices tested
M1 MacBook Pro at home
Clean MacStadium bare metal M1 in Atlanta.
Linux VM hosted on GCP (only thing that worked).
Accounts tested
Personal account created a couple of weeks ago. Validated credit card.
Company account created today. Validated credit card.
Let me know what more information I can provide, happy to help.
Ok this is super helpful, it’s not mac specific but the details might’ve helped us track down a bug. Give us a few hours and we may have a fix for you.
@Silvio_Gutierrez give it another try? We fixed a sync issue that was preventing many newly made keys from working in Virginia. I’ve confirmed most of the peer keys I see for you now.
Still not working on my local Mac. Though I had to re-sign in, so maybe a new “stuck” peer key was created.
So is this to say it wasn’t at all Mac related, but the fact that the MacStadium VM was in Atlanta, and I’m in NYC, we both got Virginia peer keys? Why does MacStadium work but not my local machine?
The Linux VM, on GCP, worked the whole time, but it was in us-central1, which is based in Iowa.
I’m guessing the Linux VM connected to Chicago. You can look at ~/.fly/config.yml and see, the region is in the endpoint hostname.
Try running fly agent stop on your local Mac and see if that helps? If that doesn’t work, try going through the fly wg create process to setup a manual wireguard connection. Then see if you can ping your app instance IPs.
But then I deactivated Wireguard and all of the commands above worked fine too, so I’m not sure if that’s a real test. Is there a way to ping an IP that’s not public to the internet?
This is all for one of the app instances. The builder has no IP exposed in the UI, so I’m not sure how to ping it.
Is there a way to force flyctl deploy to use the ORD peer?
Ah! You’ll need to run fly ips private to get a list of the private IPs. These are per VM, different than the public anycast IPs.
You can (temporarily) copy the peer block from ~/.fly/config.yml between hosts. Then run fly agent stop and when it comes back up it should use the ORD peer. I’m curious what happens, I’m fairly sure the IAD peers are in a good state so you might find that the ORD peer also hangs.
Deleting ~/.fly forced your client to create a new peer, which is active now. I think the Fly agent got itself in a bad state with the old ones. fly agent stop should have worked, though.
If you run ps aux | grep "fly agent" do you see more than one running?
Note the mix of flyctl and fly. I realize there may be deeply complex technical reasons to need an agent/daemon, but nix tends to promote completely hermetic environments (I’ve done it even with postgresql daemons). Would be cool to keep that in mind for the roadmap: multiple instances of the fly cli existing, maybe with $FLYHOME (though that would require re-auth per instance).
There’s another thread out there that’s very popular of Rust/nix/flyctl usage, so I’m not the only one pushing this usage. It’s amazing to embed the flyctl into the project and make launching your code trivial.
There is a deeply complex reason! Our agent is a userland network interface, basically. The $FLYHOME change would (not, apparently) help, we also have some fixes coming that will more reliably prevent multiple agents.
Go ahead and kill all those and see if it comes back?
It’s important that, for a given Fly.io user, irrespective of how many hermetically sealed home directories you’ve got on your machine, there only be one running agent.
Right now, we rely on $HOME/.fly/fly-agent.sock being a static path, so that when we start a new agent (for any reason), we can kill off the old one.
The sole purpose for the agent is to handle multiple WireGuard connections — without the agent, if you run flyctl ssh console in one window, and then again in another window (or, for that matter, flyctl dig or flyctl proxy), the more recent flyctl will kill the session of the previous one.
We’re discussing this thread internally! Thank you for sharing your thoughts. From what I understand, I think a $FLYHOME variable would actually make this worse?
It has some major fixes to the agent which might help with this.
Also, I took a peek at the nix package and saw that it’s an old version built slightly differently than our main releases. I would recommend using our builds directly if possible since we release frequently and can make sure upgrade paths are smooth. If Nix is something folks want maybe there’s a way to push our builds from our CI.
(Sockets are limited to 104 characters, so they can’t be colocated in deeply rooted project paths. Fun)
After running the above shellHook, running psql will connect to that instance and that instance only.
I think having a .sock per instance for flyctl wouldn’t make it worse, so long as auth were not based off of FLYHOME. Really, you’d basically want FLYSOCKET and FLYHOME. The latter rarely changing, and FLYSOCKET set to an arbitrary folder per project. Of course, auth per project is not the end of the world.
With the above, in theory, fly ssh would be the same as psql. The latter is a psql client that connects to its own postgres instance. But I don’t know the wireguard internals enough to know if, even beyond fly, there’s some wireguard black magic that has global state and can’t be isolated.
Lot of random thoughts, hopefully some of that is helpful for your internal discussions. Feel free to ping me for any more background.
I’ll definitely keep this in mind. I’m biased, but think nix is the future of… everything. But in any case, so long as upstream nix is kept up to date, I can use unstable nixpkgs to always get the latest published version, without needing to upgrade the rest of the project (nix lets you mix and match).
And if you can’t figure out a workaround for publishing quickly, one can always just have a “nix” package that actually just downloads the latest release. So long as you are able to provide a “stable” URL that always downloads the latest binary. Per platform is fine.
It looks like there’s a bot that scrapes Github releases and submits updates, though not that often. What concerns me here is that this PR is almost a month old, therefore it’s many versions behind our current release.
I pinged the Nix community to see what they think about this.