I keep getting Error failed to fetch an image or build from source: error connecting to docker: failed building options: Timeout on AddWireGuardPeer.peerip when I run my deploy job on github action. It happens about 50% of the time maybe now and is sometimes solved by running fly apps destroy on the builder.
What’s actually going on here? Can I make this reliable in any way?
I’m not sure if your Github Action has docker available locally, but until someone from Fly responds one thing you could quickly try is swap the deploy command to either use/not-use the remote builder and see which helps.
Like … swap from fly deploy to fly deploy --remote-only (or vice versa).
The Fly CLI defaults to using a local docker if it’s available and that can be better than using a remote builder. Depending on the networking, setup etc. It’s certainly worth a try swapping to see which works best for you.
Else this may be a regression, though unlikely. That problem used to happen to me a lot (a random timeout caused by a peer delay). The temporary fix back then was to persist the config.yml file that the Fly CLI makes. That file contains a load of networking stuff and presumably results in it being re-ued. Deploying from local, of course that file is always there. But from CI it’s generally not persisted. Ii say unlikely because Fly have fixed that issue. So persisting that config.yml shouldn’t be needed and deploys should take seconds. But if you want to check that out in case it’s that in the meantime too, it’s this big long thread:
Just a quick note, so that someone’s here communicating:
Github Actions jobs tend to land on our IAD WireGuard gateway. That gateway is lagging creating new WireGuard peers for reasons we’re still investigating (the code that syncs peers with the kernel’s wg state normally runs in negligible time, but takes over 10 seconds on IAD right now). There’s a timeout in our API waiting for responses from gateways that goes through that code path, and we’re sporadically exceeding that window.
We’re doing a couple different things right now. In the immediacy, we’ve flagged that gateway, so that future deploys should run through a different gateway (Chicago, I’m guessing?) while we finish working on it.
The gateway itself — if you already have peers there — should be fine; it’s just an issue adding new peers there.
It’s not a small paper cut; it’s a big deal, and we appreciate you calling it out to us. If the workaround we’ve put in place temporarily (routing Github Actions tasks, hopefully, to a different gateway) hasn’t improved things for us, please complain!