Remote Builds Failing

Remote builds seem to be timing out trying to reach the builder.
I tried destroying the app and let it re-create a new one. Same result.

==> Building with Dockerfile
Using Dockerfile Builder: Containerfile
INFO Remote only, hooking you up with a remote Docker builder...
INFO Waiting for remote builder to become available...
Error Could not ping remote builder within 5 minutes, aborting.
1 Like

Having the same issue for multiple apps, I noticed that on Saturday

@bdd looks like your builder did not launch because we lacked capacity in your “home” region. It doesn’t look like capacity is a problem from our metrics, so this might be a bug. I’ll look into it!

Every app of the same organization share the same builder. If you had troubles with an app, you’d also have troubles with another.

I looked up your builder and it seems to have worked correctly within the last ~2 hours. Was it broken before? You had the same issue as the OP? Error Could not ping remote builder within 5 minutes, aborting.

1 Like

I am deploying with the Github action. Was working fine last week this is some logs from today.

Deploying sigle-staging
==> Validating App Configuration
--> Validating App Configuration done
Services
TCP 80/443 ⇢ 3000 

Deploy source directory '/github/workspace'
==> Building with Dockerfile
Using Dockerfile Builder: sigle/Dockerfile
INFO Remote only, hooking you up with a remote Docker builder...
INFO Waiting for remote builder to become available...
Error Could not ping remote builder within 5 minutes, aborting.

The latest flyctl update (v0.0.186) has a fix that should make connections to docker remote daemons more reliable. Could you update and try again and see if it’s fixed?

I just got this as well first time today, I did update to 186, it gets to INFO Remote builder is ready to build! and stops, builder logs:

2021-03-08T23:24:04.425Z e9dcf7b2 lax [info] time="2021-03-08T23:24:04.408735079Z" level=info msg="received SIGUSR1, resetting job deadline"
2021-03-08T23:24:05.114Z e9dcf7b2 lax [info] time="2021-03-08T23:24:05.097545990Z" level=debug msg="Calling HEAD /_ping"
2021-03-08T23:24:05.394Z e9dcf7b2 lax [info] time="2021-03-08T23:24:05.377465491Z" level=debug msg="Calling HEAD /_ping"
2021-03-08T23:24:05.641Z e9dcf7b2 lax [info] time="2021-03-08T23:24:05.624383452Z" level=debug msg="Calling HEAD /_ping"
2021-03-08T23:24:05.715Z e9dcf7b2 lax [info] time="2021-03-08T23:24:05.698194066Z" level=debug msg="Calling HEAD /_ping"
2021-03-08T23:24:05.745Z e9dcf7b2 lax [info] time="2021-03-08T23:24:05.729521124Z" level=debug msg="Calling HEAD /_ping"
2021-03-08T23:24:06.208Z e9dcf7b2 lax [info] time="2021-03-08T23:24:06.184536486Z" level=debug msg="Calling POST /v1.40/build?buildargs=%7B%7D&cachefrom=null&cgroupparent=&cpuperiod=0&cpuquota=0&cpusetcpus=&cpusetmems=&cpushares=0&dockerfile=&labels=null&memory=0&memswap=0&networkmode=&platform=linux%2Famd64&rm=0&shmsize=0&t=registry.fly.io%2Fdish-image-quality%3Adeployment-1615245843&target=&ulimits=null&version="

@nate that VM had this in the logs a few times:

Handler for POST /v1.40/build returned error: Error processing tar file(exit status 1): unexpected EOF

I haven’t seen that before but I’ll try to reproduce. How big is the docker context for your build? Also could you try running flyctl deploy like this to see if it’s still failing?

DOCKER_BUILDKIT=0 flyctl deploy ...

That may have been it, its largely similar to the dockerfile in contrib/tf_serving folder here: GitHub - idealo/image-quality-assessment: Convolutional Neural Networks to predict the aesthetic and technical quality of images.

Edit: actually it’s a bit different, basically:

FROM tensorflow/tensorflow:2.0.0-py3

# Install system packages
RUN apt-get update && apt-get install -y --no-install-recommends \
      bzip2 \
      g++ \
      git \
      graphviz \
      libgl1-mesa-glx \
      libhdf5-dev \
      openmpi-bin \
      wget && \
    rm -rf /var/lib/apt/lists/*

Which may have been it? Large context or disabling buildkit? I’m trying to reproduce by sending a few GB to docker to build but it seems to be working.

Don’t think it’s buildkit:

Oh ok, after a few minutes it picked it up, perhaps it’s the context! Maybe an upload indicator would help there. I’m on a slow internet connection for another week or two.

Okay great. I just tested with a 3gb context and it also took several minutes to start with no feedback. We’ll start printing the context size and add a progress indicator.

1 Like

Started getting this one just now on Big Sur, I know due to SIP but not sure why suddenly happening:

Using Dockerfile Builder: /Users/n8/app/Dockerfile
Error lchown /var/folders/ss/ppy_gf99651fh1lcfms7xhrm0000gn/T/799315719/tmp: operation not permitted

I think this is an issue with the archiver that builds the context rather than SIP. I haven’t been able to reproduce myself, but after debugging with a few other folks it seems like the archiver has issues following symlinks, eg node_module hoisting or incorrectly copying permissions to a temp dir. This keeps coming up so I’m going to work on a fix this week.

1 Like

@nate this got held up a bit, but the archiver was updated this week. We’re no longer copying your entire context to a temp directory, so I expect lchown/permission errors to go away and for archiving to be much faster. Give it a try and let me know if you have any issues.