Unreliable deploys with Github Actions

I keep getting
Error failed to fetch an image or build from source: error connecting to docker: failed building options: Timeout on AddWireGuardPeer.peerip when I run my deploy job on github action. It happens about 50% of the time maybe now and is sometimes solved by running fly apps destroy on the builder.

What’s actually going on here? Can I make this reliable in any way?

I’m not sure if your Github Action has docker available locally, but until someone from Fly responds one thing you could quickly try is swap the deploy command to either use/not-use the remote builder and see which helps.

Like … swap from fly deploy to fly deploy --remote-only (or vice versa).

The Fly CLI defaults to using a local docker if it’s available and that can be better than using a remote builder. Depending on the networking, setup etc. It’s certainly worth a try swapping to see which works best for you.

Else this may be a regression, though unlikely. That problem used to happen to me a lot (a random timeout caused by a peer delay). The temporary fix back then was to persist the config.yml file that the Fly CLI makes. That file contains a load of networking stuff and presumably results in it being re-ued. Deploying from local, of course that file is always there. But from CI it’s generally not persisted. Ii say unlikely because Fly have fixed that issue. So persisting that config.yml shouldn’t be needed and deploys should take seconds. But if you want to check that out in case it’s that in the meantime too, it’s this big long thread:

Thanks for helping!!

Unfortunately we’re already using --remote-only :frowning:

Here’s the command we run:
args: "deploy --remote-only --config fly-prod.toml --dockerfile Dockerfile --build-arg OBAN_KEY_FINGERPRINT=${{secrets.OBAN_KEY_FINGERPRINT}} --build-arg OBAN_LICENSE_KEY=${{secrets.OBAN_LICENSE_KEY}}"

Just a quick note, so that someone’s here communicating:

Github Actions jobs tend to land on our IAD WireGuard gateway. That gateway is lagging creating new WireGuard peers for reasons we’re still investigating (the code that syncs peers with the kernel’s wg state normally runs in negligible time, but takes over 10 seconds on IAD right now). There’s a timeout in our API waiting for responses from gateways that goes through that code path, and we’re sporadically exceeding that window.

We’re doing a couple different things right now. In the immediacy, we’ve flagged that gateway, so that future deploys should run through a different gateway (Chicago, I’m guessing?) while we finish working on it.

The gateway itself — if you already have peers there — should be fine; it’s just an issue adding new peers there.

2 Likes

Thanks for the update!

It’s just a small paper cut but it gets frustrating when it happens multiple times a day.

Appreciate the work y’all are doing :raised_hands:

It’s not a small paper cut; it’s a big deal, and we appreciate you calling it out to us. If the workaround we’ve put in place temporarily (routing Github Actions tasks, hopefully, to a different gateway) hasn’t improved things for us, please complain!

1 Like

It hasn’t happened again today!

I’m getting this error on Github Actions. My workflow is

    steps:
      - name: Checkout
        uses: actions/checkout@v2
      - name: Deploy app
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v2
          - uses: superfly/flyctl-actions/setup-flyctl@master
          - run: flyctl deploy --remote-only

And the error I’m getting is

Run flyctl deploy --remote-only
==> Verifying app config
--> Verified app config
==> Building image
Waiting for remote builder fly-builder-icy-mountain-6499...
Error failed to fetch an image or build from source: failed building options: Timeout on AddWireGuardPeer.peerip

We’re poking at this now.

Sounds good. I saw two failures, then two successes. I’m good for now, I’ll just keep retrying but it does appear to be spotty.

Yeah, there’s something goofy happening — Github actions all tend to land on a specific IAD gateway, and new WireGuard peers are starting to take multiple seconds to create there, which we’re working on; in the immediacy we’re moving the target gateway for new peers to a different gateway.

What kills us here, for what it’s worth, was a decision we made a year and a half ago for clients to generate their own WireGuard keypairs, so none of our servers needed to keep them — which means that every time a Github action runs, it creates a new temporary peer, because there’s no way to “look up” a previously-used peer from our API.

We’re going to do something in flyctl to account for this tomorrow, I think.

1 Like

Hey! I’m pinging back to say that we did some diagnosing work yesterday and found a change we’d rolled out that had ramped up CPU usage on the gateway machines. It didn’t break the gateways, but it did slow down the provisioning process, and on that one IAD gateway it pushed it past the timeout threshold for our API.

We had this resolved yesterday while I was responding, but I did want to shoot you a line that we’ve also root-caused it, and rolled back the offending change.

One short-term thing likely to happen here is that we’re going to beef up the gateways; they’re weak compared to the rest of our hardware.

You don’t and shouldn’t care about any of this, I’m just overcommunicating.

2 Likes

@thomas Are you still planning to do something about this? flyctl wireguard list lists a ton of peer connections for us since a new one is created every time our GH actions run, which ultimately makes flyctl wireguard list pretty useless for understanding anything. I’ve tried to periodically remove all the GH action peer connections but in recent days even trying to remove a single one of them takes 10+ seconds and often times out – maybe beefing up the gateways will help with that?

1 Like

Hey there, I’ve been facing this issue from time to time, usually retrying all jobs solves the problem.

You can check the GitHub Action log here:

Hi all -

Is there any update on this issue? I deploy multiple times per day, and I don’t think a day has ever gone by where this wasn’t an issue. I have simply developed the unfortunate habit of sitting and watching deployments in anticipation of failure so I can retry them. Oftentimes retries fail as well, causing me to regularly waste ~15 minutes per day babysitting these deployments.

Sometimes I just give up on the day, patiently waiting for that magical day this is not an issue. It is unfortunate because I spent a lot of time and resources migrating to fly, believing it was a better and more cost effective option, yet this lingering issue is a constant pain-in-the-side that has now reached a boiling point where I’m highly motivated to make the move back off of fly.io

1 Like

This wireguard issue should be almost entirely resolved now. Can you post the results of a failed deploy in a new topic? It sounds like something else might be going wrong.