I keep getting Error failed to fetch an image or build from source: error connecting to docker: failed building options: Timeout on AddWireGuardPeer.peerip when I run my deploy job on github action. It happens about 50% of the time maybe now and is sometimes solved by running fly apps destroy on the builder.
What’s actually going on here? Can I make this reliable in any way?
I’m not sure if your Github Action has docker available locally, but until someone from Fly responds one thing you could quickly try is swap the deploy command to either use/not-use the remote builder and see which helps.
Like … swap from fly deploy to fly deploy --remote-only (or vice versa).
The Fly CLI defaults to using a local docker if it’s available and that can be better than using a remote builder. Depending on the networking, setup etc. It’s certainly worth a try swapping to see which works best for you.
Else this may be a regression, though unlikely. That problem used to happen to me a lot (a random timeout caused by a peer delay). The temporary fix back then was to persist the config.yml file that the Fly CLI makes. That file contains a load of networking stuff and presumably results in it being re-ued. Deploying from local, of course that file is always there. But from CI it’s generally not persisted. Ii say unlikely because Fly have fixed that issue. So persisting that config.yml shouldn’t be needed and deploys should take seconds. But if you want to check that out in case it’s that in the meantime too, it’s this big long thread:
Just a quick note, so that someone’s here communicating:
Github Actions jobs tend to land on our IAD WireGuard gateway. That gateway is lagging creating new WireGuard peers for reasons we’re still investigating (the code that syncs peers with the kernel’s wg state normally runs in negligible time, but takes over 10 seconds on IAD right now). There’s a timeout in our API waiting for responses from gateways that goes through that code path, and we’re sporadically exceeding that window.
We’re doing a couple different things right now. In the immediacy, we’ve flagged that gateway, so that future deploys should run through a different gateway (Chicago, I’m guessing?) while we finish working on it.
The gateway itself — if you already have peers there — should be fine; it’s just an issue adding new peers there.
It’s not a small paper cut; it’s a big deal, and we appreciate you calling it out to us. If the workaround we’ve put in place temporarily (routing Github Actions tasks, hopefully, to a different gateway) hasn’t improved things for us, please complain!
Run flyctl deploy --remote-only
==> Verifying app config
--> Verified app config
==> Building image
Waiting for remote builder fly-builder-icy-mountain-6499...
Error failed to fetch an image or build from source: failed building options: Timeout on AddWireGuardPeer.peerip
Yeah, there’s something goofy happening — Github actions all tend to land on a specific IAD gateway, and new WireGuard peers are starting to take multiple seconds to create there, which we’re working on; in the immediacy we’re moving the target gateway for new peers to a different gateway.
What kills us here, for what it’s worth, was a decision we made a year and a half ago for clients to generate their own WireGuard keypairs, so none of our servers needed to keep them — which means that every time a Github action runs, it creates a new temporary peer, because there’s no way to “look up” a previously-used peer from our API.
We’re going to do something in flyctl to account for this tomorrow, I think.
Hey! I’m pinging back to say that we did some diagnosing work yesterday and found a change we’d rolled out that had ramped up CPU usage on the gateway machines. It didn’t break the gateways, but it did slow down the provisioning process, and on that one IAD gateway it pushed it past the timeout threshold for our API.
We had this resolved yesterday while I was responding, but I did want to shoot you a line that we’ve also root-caused it, and rolled back the offending change.
One short-term thing likely to happen here is that we’re going to beef up the gateways; they’re weak compared to the rest of our hardware.
You don’t and shouldn’t care about any of this, I’m just overcommunicating.
@thomas Are you still planning to do something about this? flyctl wireguard list lists a ton of peer connections for us since a new one is created every time our GH actions run, which ultimately makes flyctl wireguard list pretty useless for understanding anything. I’ve tried to periodically remove all the GH action peer connections but in recent days even trying to remove a single one of them takes 10+ seconds and often times out – maybe beefing up the gateways will help with that?
Is there any update on this issue? I deploy multiple times per day, and I don’t think a day has ever gone by where this wasn’t an issue. I have simply developed the unfortunate habit of sitting and watching deployments in anticipation of failure so I can retry them. Oftentimes retries fail as well, causing me to regularly waste ~15 minutes per day babysitting these deployments.
Sometimes I just give up on the day, patiently waiting for that magical day this is not an issue. It is unfortunate because I spent a lot of time and resources migrating to fly, believing it was a better and more cost effective option, yet this lingering issue is a constant pain-in-the-side that has now reached a boiling point where I’m highly motivated to make the move back off of fly.io