Unreliable deploys with Github Actions

I keep getting
Error failed to fetch an image or build from source: error connecting to docker: failed building options: Timeout on AddWireGuardPeer.peerip when I run my deploy job on github action. It happens about 50% of the time maybe now and is sometimes solved by running fly apps destroy on the builder.

What’s actually going on here? Can I make this reliable in any way?

I’m not sure if your Github Action has docker available locally, but until someone from Fly responds one thing you could quickly try is swap the deploy command to either use/not-use the remote builder and see which helps.

Like … swap from fly deploy to fly deploy --remote-only (or vice versa).

The Fly CLI defaults to using a local docker if it’s available and that can be better than using a remote builder. Depending on the networking, setup etc. It’s certainly worth a try swapping to see which works best for you.

Else this may be a regression, though unlikely. That problem used to happen to me a lot (a random timeout caused by a peer delay). The temporary fix back then was to persist the config.yml file that the Fly CLI makes. That file contains a load of networking stuff and presumably results in it being re-ued. Deploying from local, of course that file is always there. But from CI it’s generally not persisted. Ii say unlikely because Fly have fixed that issue. So persisting that config.yml shouldn’t be needed and deploys should take seconds. But if you want to check that out in case it’s that in the meantime too, it’s this big long thread:

Thanks for helping!!

Unfortunately we’re already using --remote-only :frowning:

Here’s the command we run:
args: "deploy --remote-only --config fly-prod.toml --dockerfile Dockerfile --build-arg OBAN_KEY_FINGERPRINT=${{secrets.OBAN_KEY_FINGERPRINT}} --build-arg OBAN_LICENSE_KEY=${{secrets.OBAN_LICENSE_KEY}}"

Just a quick note, so that someone’s here communicating:

Github Actions jobs tend to land on our IAD WireGuard gateway. That gateway is lagging creating new WireGuard peers for reasons we’re still investigating (the code that syncs peers with the kernel’s wg state normally runs in negligible time, but takes over 10 seconds on IAD right now). There’s a timeout in our API waiting for responses from gateways that goes through that code path, and we’re sporadically exceeding that window.

We’re doing a couple different things right now. In the immediacy, we’ve flagged that gateway, so that future deploys should run through a different gateway (Chicago, I’m guessing?) while we finish working on it.

The gateway itself — if you already have peers there — should be fine; it’s just an issue adding new peers there.

1 Like

Thanks for the update!

It’s just a small paper cut but it gets frustrating when it happens multiple times a day.

Appreciate the work y’all are doing :raised_hands:

It’s not a small paper cut; it’s a big deal, and we appreciate you calling it out to us. If the workaround we’ve put in place temporarily (routing Github Actions tasks, hopefully, to a different gateway) hasn’t improved things for us, please complain!

1 Like

It hasn’t happened again today!