Deploy error: unable to connect WireGuard tunnel

Just a heads-up about a new error I’ve been getting when deploying:

...
Executing command: flyctl deploy --app NAME --remote-only --strategy bluegreen                                    
WARN no config file found at /root/.fly/config.yml                                                                                                               
==> Verifying app config                                                                                                                                         
--> Verified app config                                                                                                                                          
==> Building image                                                                                                                                               
Waiting for remote builder fly-builder-blue-sun-7870... connecting ⣾ Error failed to fetch an image or build from source: error connecting to docker: unable to connect WireGuard tunnel: context deadline exceeded

Whenever I get a deploy error I just destroy the builder app, try again, and usually that works. But not this time. Fails again. Hmm.

...
==> Building image                                                                                                                                               
Waiting for remote builder fly-builder-rough-shadow-8960... connecting ⢿ Error failed to fetch an image or build from source: error connecting to docker: unable to connect WireGuard tunnel: context deadline exceeded 

Any issues at the moment?

Thanks!

Hey, nothing specific going on now. This error means there was a timeout connecting to your Fly.io network. If you have Docker installed, can you try with a local deployment (without --remote-only)?

Strange. Using a local Docker is a bit of a pain on a Mac. That deployment was being done from a server which has not got Docker installed and which usually works (well, after deleting the builder after random failures). Its network etc is fine so can’t think what else it would be.

Ah well I’ll try it again. Thanks.

It would be good to know if remote builds start working again for you. This year we’ll be looking into adding more debugging tools for problems like this.

Meanwhile you could try installing Docker on the server and run a ‘local’ build there.

Yep, remote builds are working again. Connecting to the remote builder is slow, but it does happen eventually and proceeds from there.

I still get a little downtime, even with a bluegreen deploy (I tried default, canary, and now bluegreen to see if it helps) but that is a separate issue (how do fly.io deploys work?). I have seen some bonus errors in the logs which seem new though:

error.message="Internal problem" 2022-01-01T20:04:44.504 proxy[fde84576] lhr [error]error.code=2 
error.message="Internal problem" 2022-01-01T20:04:46.614 proxy[c0d00137] cdg [error]error.code=2 
error.message="Internal problem" 2022-01-01T20:04:51.589 proxy[fde84576] lhr [error]error.code=2 
error.message="Internal problem" 2022-01-01T20:04:55.427 proxy[c0d00137] cdg [error]error.code=2 
error.message="Internal problem" 2022-01-01T20:05:00.324 proxy[fde84576] lhr [error]error.code=2

But then they go away. Ah well, it’s working now.

Just tried my first deployment of a Remix app from WSL1, and got the same error:

Waiting for remote builder fly-builder-bitter-haze-3347... connecting ⡿ Error failed to fetch an image or build from source: error connecting to docker: unable to connect WireGuard tunnel: context deadline exceeded

I’m getting the same thing. My first try with fly. From WSL1, too.

Waiting for remote builder fly-builder-throbbing-frost-7559… connecting ⣟ Error failed to fetch an image or build from source: error connecting to docker: unable to connect WireGuard tunnel: context deadline exceeded

I got past that problem by upgrading to WSL 2.

(I think WSL 1 was trying to do a local build and docker will not work in WSL 1.)

Getting this again, on Mac. Seems to be a bit random, 50/50.

Waiting for remote builder fly-builder-green-cherry-3653... connecting ⣯ Error failed to fetch an image or build from source: error connecting to docker: unable 
to connect WireGuard tunnel: context deadline exceeded 

Generally it will fail once or twice, I’ll destroy the builder app, try a deploy again, and usually then it works. Does waste a bit of time doing all that though. Just a heads up.

Update: destroying the builder didn’t work. Ah well.

==> Building image                                                                                                                                               
Waiting for remote builder fly-builder-floral-silence-2084... connecting ⡿ Error failed to fetch an image or build from source: error connecting to docker: unabl
e to connect WireGuard tunnel: context deadline exceeded 

Has there been any thoughts/progress on this issue, team Fly? Other people are getting it so it’s not just me :slight_smile:

I had it again just now when I wanted to push out a fix. Not sure I can install docker on the server, so looks like either doing local deploys only, or rolling the dice on a remote deploy.

Waiting for remote builder fly-builder-silent-butterfly-4748... connecting ⢿ Error failed to fetch an image or build from source: error connecting to docker: unable to connect WireGuard tunnel: context deadline exceeded

Hey, if you don’t mind helping me debug this. How long has this particular builder(fly-builder-silent-butterfly-4748) been around?

Probably a day or so, as generally what happens is I deploy, it fails, I destroy the builder, and repeat.

So I’ve already destroyed that builder :slight_smile:

Each builder app is under no load as I only do one deploy at a time so the only issue is it seems to not be able to connect to … something at your end. Generally it will sit there for X minutes, waiting, and then time out and fail.

1 Like

I’ve now got another builder app to work with since I just tried another deploy. So it created another builder app, and waited and waited and waited to connect and … failed.

So this is a brand new builder app, a few minutes old. So it can’t have any junk or left over files taking up space, or be under any load. Just gives up after maybe three minutes trying to connect. I noticed it stuck and started timing and it waited at least two minutes before failing.

It’s on a CI server, so it’s not my home internet at fault or anything like that. And deploys have worked in the past. They just randomly don’t. Like now.

==> Verifying app config                                                                                                                                         
--> Verified app config                                                                                                                                          
==> Building image                                                                                                                                               
Waiting for remote builder fly-builder-solitary-shape-9303... connecting ⡿ Error failed to fetch an image or build from source: error connecting to docker: unable to connect WireGuard tunnel: context deadline exceeded 

Before you destroy more builders, will you try some other commands and see if they work? Also make sure you’re on the latest flyctl version:

fly agent stop
fly dig txt _apps.internal
fly machines list -a fly-builder-solitary-shape-9303

I’m not sure deleting builders is actually helping, exactly. There are a bunch of things between you and the builder that could be going wrong and timing out.

Also, how are you running deploy? Are you doing it locally or hitting issues from a CI system?

With a CI system which makes it trickier to debug.

I got the idea about destroying the builder app to ensure it got a new, latest one from here, as I recall, from one of my prior run-ins with it:

So that’s been my go-to “fix” as it randomly does work afterwards. But it’s not ideal and that may be a coincidence.

You may be on to something with the list command though as first, it seemed the builder app has no machines …

~ $ fly machines list -a fly-builder-solitary-shape-9303
ID	IMAGE	CREATED	STATE	REGION	NAME	IP ADDRESS 
~ $ fly apps list
NAME                           	OWNER 	STATUS   	LATEST DEPLOY        
...	
fly-builder-solitary-shape-9303	name	pending
...

Which would explain why it’s waiting forever for it. if it’s still pending. Interesting. It still is.

So I tried running that command again, and this time it does have a machine listed … but it still shows as pending …

~ $ fly apps list
NAME                           	OWNER 	STATUS   	LATEST DEPLOY        
...
fly-builder-solitary-shape-9303	name	pending  	                    	
...

~ $ fly machines list -a fly-builder-solitary-shape-9303
ID      	IMAGE      	CREATED                      	STATE  	REGION	NAME           	IP ADDRESS                   
89006c93	flyio/rchab	2022-01-17 15:26:37 +0000 UTC	started	iad   	wispy-rain-5546	fdaa:0:5d9:a7b:21e0:0:78bc:2

The latest deploy I just tried failed again, same issue. So maybe it is the ‘pending’ that is the cause.

Or … maybe could it be the region? The builder is in iad, but the app I want to deploy has one instance … in lhr.

As for the CLI, it is the latest version, as I install the CLI fresh each time:

Executing command: /root/.fly/bin/fly version                                                                                                                    
WARN no config file found at /root/.fly/config.yml                                                                                                               
flyctl v0.0.282 linux/amd64 Commit: 02c46ec BuildDate: 2022-01-12T17:44:39Z  

I also tried running those suggested commands from the CI script too:

Executing command: /root/.fly/bin/fly agent stop                                                                                                                 
WARN no config file found at /root/.fly/config.yml                                                                                                               
Error can't connect to agent: dial unix /root/.fly/fly-agent.sock: connect: no such file or directory 

… no agent, but I guess perhaps that is correct there isn’t.

Executing command: /root/.fly/bin/fly dig txt _apps.internal                                                                                                     
Reading environment variable exporting file contents.                                                                                                            
Reading environment variable exporting file contents.                                                                                                            
WARN no config file found at /root/.fly/config.yml                                                                                                               
Error get app: Could not resolve App 

Didn’t like that either. Failed with an error.

The same process (remote, deploy, using builder) did work two days ago when I last did a deploy, using the same CI process, server etc, so that’s why it’s strange. If it was a network issue you’d think it would never work or always work. Hmm.

Maybe it is the builder stuck at pending and/or the region it is in. Maybe deleting ones have happened to either move one into running and/or make one in lhr. I’ve not looked at the builder app status or region before.

Ah, the most likely problem here is just slow propagation of wireguard peers. CI builds create a new wireguard peer on each run and those can take 2-3 minutes to come up. They’re stored in .fly/config.yml, which doesn’t get reused between CI runs.

For CI, you should use the local Docker. Change your command to fly deploy --local-only to force that. This will save you a tremendous number of headaches.

It’s normal for a builder app to show as pending, you can ignore that.

1 Like

Ah …

So … it’s a race where the propagate starts and the builder waits … and if the propagation happens to complete in, say, two minutes (ie before the builder times out) it proceeds on and deploys … but if not (like it takes four minutes) the builder times out waiting and fails? Interesting. That would indeed explain what’s been happening and why on each CI run it just randomly works or doesn’t with nothing else apparently changing.

You are of course right and using a local docker would indeed avoid the issue entirely :slight_smile: But … it doesn’t work. I’ve been down that road before with this!

As it turns out Codefresh uses docker and all builds take place in docker containers. But you can’t access the docker from inside of that docker container. As shown when trying --local-only. Sigh. Which is why remote builds are great as they solve that issue:

...                                                                                                                            
Validating connection to Docker daemon...                                                                                                                        
Connection to Docker daemon validated 
...
Executing command: /root/.fly/bin/fly deploy --app name --build-arg STAGE=dev -local-only --strategy bluegreen                                        
WARN no config file found at /root/.fly/config.yml                                                                                                               
==> Verifying app config                                                                                                                                         
--> Verified app config                                                                                                                                          
==> Building image                                                                                                                                               
Error failed to fetch an image or build from source: docker is unavailable to build the deployment image 
...

Any ideas of timescale for fixing the slow propagation of wireguard peers?

Only I’m thinking that not only would that solve this issue, it would also solve the issue of up to a minute of app downtime on deploys (and so the need for a minimum of 3 instances to kind of bypass that issue).

And then: :rocket:

If you can cache and reuse /root/.fly/config.yml, it might help.

There’s no timeframe for fixing the propagation problems. We’re having to rebuild a tremendous amount of our infrastructure to fix both this and the service staleness problem.

1 Like